Linear Algebra Notes
Linear Algebra: Notes on the Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Partial Least Squares Regression (PLS) and SIMPLS
1 SVD
Let , ,
Let ,
Let
Let ,
By convention, but not by mathematical necessity
With when is of rank
See Section 6.1:
Cols of are the left singular vectors, cols of (rows of ) are the right singular vectors and are the singular values of . The condition number of ,
2 Generalized Inverse
Let
When is of rank , for computational purposes, it is ok to extend to a larger square matrix and only invert the non-zero entries and leave the zeros intact to get .
3 Low Rank Approximation
Let , i.e., singular values less than a threshold are set to 0. Then the low rank approximation of is given by:
For computational efficiency, may be replaced with a smaller version when there are remaining singular values and only the first cols of and the first rows of need to be retained. This can also be used for compression and denoising.
4 Diagonalizing Covariance
Let represent data where each row is an observation and each column is a variable. Let X be in column centered form, i.e., the mean of each column has been subtracted from that column yielding . Let represent a projected version of .
diagonalizes the covariance when observations are along rows and the data is column centered.
Let represent data where each column is an observation and each row is a variable. Let X be in row centered form, i.e., the mean of each row has been subtracted from that row yielding . Let represent a projected version of .
diagonalizes the covariance when observations are along columns and the data is row centered.
5 PCA
Let X be row centered data, i.e., columns of X are observations. Find a projection that “best” expresses the data. Best means: 1) Assume that has to be maximized. Further assume that the largest variance is because of the signal and the noise is uncorrelated with the signal. 2) Reduce redundancy. If the projection converts the data to a non-redundant form then each projected variable is independent of other projected variables, so the covariance matrix of Y will be diagonal. Equation 4.2 shows that is the required projection matrix. Alternate derivation:
Let with eigen values in a diagonal matrix
So the covariance matrix can be factored into a product with a diagonal matrix at the center.
So this choice of P diagonalizes the covariance matrix. The rows of (cols of ) are the principal components of . By comparison with equation 4.2, . Similarly when the data is column centered and the observations are along the rows, right multiplication by is the apropriate projection according to equation 4.1.
6 Linear Algebra Preliminaries
6.1 Column Orthonormal Matrix
If is column orthonormal, .
Therefore . When , cannot be column orthonormal. Since , (and ) this can be true only when . Thus only when is column orthonormal and square. Otherwise it can be made column orthonormal by padding on additional columns of orthornormal vectors.
6.2 Properties of Inverse
Let be symmetric, . Equation 6.1 gives:
When is symmetric if it exists is also symmetric.
6.3 Eigen Vectors of a Symmetric Matrix
Eigen vectors corresponding to distinct eigen values of a symmetric matrix are orthogonal.
Let , , ,
Since ,
6.4 Subtracting out components to make residual orthogonal
To subtract all components of vector along the direction leaving the residual orthogonal to :
Proof:
Let , , so contains one observation on each row. To orthogonalize the residual along the rows:
Here gives the components of each row of along , flips for subtraction along rows of and the product gives scaled versions of for subtraction along the rows of .
Proof:
Let , , so contains one observation on each column. To orthogonalize the residual along the columns: .
Proof: Let . Equation 6.2 gives:
7 PLS Regression
For PCR, the Y values are an afterthought. First explain the structure of X using principal components, then regress from projected X to Y. PLS considers X and Y simultaneously by extracting their shared structure. It is assumed that and are already column centered.
and are the residuals, are the “latent components” and and are the “loadings”. Let stand for , an operator that extracts a unit vector.
7.1 NIPALS (Iterative PLS)
Let , , is a unit vector along which is projected, is a unit vector along which is projected, is the unit vector of the projected , is the projected . Initialize the first column of (or set to a random unit vector in step 1 below). The goal is to maximize for each extracted latent component. Repeat to convergence to extract the latent component:
Let , , .
Deflate: , . So the residual is formed by removing everything in the direction of from the columns of . Deflation of is similar except factors scaled up to the dot product of and are removed. Deflation assures that . Proof: since .
After extracting latent components, when the residual is small, 7.1 gives:
7.2 NIPALS: Theory of Operation
Start in step 4 and assume we already know . The algorithm needs to maximize . This is maximized when is the unit vector along justifying step 1. Now assume is known. The algorithm needs to maximize . This is maximized when is the unit vector along justifying step 3.
Starting in step 1 and working backwards: . This is called power-iteration. Starting with a random unit vector and iterating will on convergence result in:
Proving properties of this algorithm is diffcult (but possible) because after extracting the first latent component the algorithm operates on residuals obtained through deflation which obfuscates the connection to the original and . Knowing that the iteration converges on dominant eigen vectors we can formulate an SVD based version named SIMPLS. This is identical in the first latent factor, but diverges from NIPALS for later factors because of a different deflation scheme. Once SIMPLS is understood it can also be implemented using power-iteration instead of SVD if necessary. Other advantages of SIMPLS: a) The psuedo-inverse need not be calculated. b) For large data sets repeated deflation of and can be problematic. In SIMPLS only the covariance which is smaller is deflated. c) Convenient to find objective function since all relevant vectors can be directly expressed in terms of and unlike NIPALS which uses residuals and .
8 SIMPLS (Binu's Version)
# Inputs: X, Y both column centered.
# rcond = Parameter to decide when to stop extracting latent components
k = number of columns of X
for i in range(k):
U1, D1, VT1 = svd(S)
# Columns of U1 = Eigen vectors of
# Rows of VT1 = Eigen vectors of
w = first column of U1
c = first row of VT1
# Largest singular value squared = eigenvalue of w,c
# Use first eigen value to set a threshold. When we fall below
# the threshold, stop extracting latent components
if i == 0:
else:
if :
k = i
break
# Project X
# Keep t as a unit vector
# Re-scale w to ensure
if i > 0:
# Deflate wrt previous vectors in
# Convert v to a unit vector
Append to as columns respectively
# Optional: append to columns of if we need to reconstruct
# Regression matrix.
return
As with NIPALS:
By construction the following are ensured:
8.1 SIMPLS: Theory of Operation
SIMPLS works by deflating the covariance matrix instead of using residuals , like in NIPALS. Let represent the values of the respective matrices on entry to the loop iteration i. On exit from the loop these are updated for the next iteration. The vectors are computed during loop iteration .
Let = residual covariance matrix,
Let, . Then columns of = eigen vectors of . Rows of = eigen vectors of .
Two interpretations of are possible.
- As with NIPALS, on the first iteration and are eigen vectors of and respectively. This does not hold after the first extraction.
- is the largest component/rank 1 approximation of the residual covariance. So the algorithm is attempting to deconstruct the covariance matrix. We cannot subtract out from the covariance though because there is some regression coefficient that goes from projected X to projected Y. Subtracting out will cause . So we need to further deconstruct this and also decide on a deflation strategy for .
By constraint equation 8.6, , so we ensure by doing . To satisfy constraint equation 8.5, we do so that . When residuals are , and . Equations 8.3, 8.4 and 8.6give:
So the columns of should consist of and respectively. The residual vanishes only when and which happens when has accumulated enough components to become square with . Until then only the constraint is ensured by construction of . Having extracted one latent component, we need to deflate by some vector and then we can move on to the next iteration.
8.2 Choice of Deflation Vector
Let be some unit vector used to deflate columns of . We will derive the right choice of to ensure .
Let the columns of be the past deflation vectors. The most general scheme is to first deflate with respect to prior vectors to ensure , then use the deflated version of to deflate .
Let by construction.
Append the unit vector of to generate
This ensures that since , , . The algorithm incrementally deflates , but because the columns of are orthogonal to each other we can also do this directly from the original .
Because is an eigen vector of :
is a scaled version of . So for some scalar
To ensure constraint equation is satisfied we need so that each latent component is orthogonal to previous components. If it were the case that , then equation 8.9 becomes:
So the choice of or equivalently ensures the required constraint. Therefore the right choice of deflation vector is: .