Linear Algebra Notes
Linear Algebra: Notes on the Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Partial Least Squares Regression (PLS) and SIMPLS

SVD

Let X m × n , V= [ | v i | ] n × r , | v i |=1
Let w i =X v i , σ i =| w i |
Let u i = w i ˆ = w i | w i | = w i σ i
Let U= [ | u i | ] m × r , D= [ σ 1 0 0 σ 2 0 0 0 0 σ r ] r × r
By convention, but not by mathematical necessity σ i σ j i<j
X v i = w i = u i σ i
XV=UD (1.1)
With V n × r =Eig( X T X ) when X T X is of rank r
X T X v i = λ i v i
v i v j = δ ij
V T V= I r × r
See Section 6.1: V V T = I n × n (1.2)
w i w i = w i T w i =(X v i ) T ( X v i ) = v i T ( X T X v i ) = λ i v i T v i = λ i
σ i =| w i |= λ i
u i = w i ˆ = w i | w i | = X v i λ i
u i u j = u i T u j = 1 λ i λ j (X v i ) T ( X v j ) = 1 λ i λ j v i T ( X T X v j ) = 1 λ i λ j v i T λ j v j = λ j λ i v i T v j = δ ij
U T U= I r × r
U U T = I m × m
Equations 1.11.2 give: XV=UD
XV V T =X I n × n =UD V T
The SVD is: X=UD V T (1.3)
Cols of U are the left singular vectors, cols of V (rows of V T ) are the right singular vectors and diag( D ) are the singular values of X . The condition number of X , κ ( X ) = σ 1 σ r

Generalized Inverse

Let D -1 = [ 1 σ 1 0 0 1 σ 2 0 0 0 0 1 σ r ] r × r
D D -1 = D -1 D= I r × r (2.1)
Equations 1.3, 2.1 give:
X=UD V T
X( V D -1 U T ) =UD V T V D -1 U T = I m × m
X right -1 =V D -1 U T
( V D -1 U T ) X=V D -1 U T UD V T = I n × n
X left -1 =V D -1 U T
X -1 =V D -1 U T
When D is of rank r<min( m,n ) , for computational purposes, it is ok to extend D to a larger square matrix and only invert the non-zero entries and leave the zeros intact to get D -1 .

Low Rank Approximation

Let D ˆ ij = D ij 1 D ij < τ , i.e., singular values less than a threshold are set to 0. Then the low rank approximation of X is given by:
X ˆ = U ˆ D ˆ V ˆ T
For computational efficiency, D ˆ may be replaced with a smaller k × k version when there are k remaining singular values and only the first k cols of U ˆ and the first k rows of V ˆ T need to be retained. This can also be used for compression and denoising.

Diagonalizing Covariance

Let X represent data where each row is an observation and each column is a variable. Let X be in column centered form, i.e., the mean of each column has been subtracted from that column yielding E[ X :,j ] =0 . Let Y=XV represent a projected version of X .
XV=( UD V T ) V=UD ( m-1 ) cov( Y ) = Y T Y=(UD ) T UD= D T U T UD=DD (4.1)
Y=XV diagonalizes the covariance when observations are along rows and the data is column centered.
Let X represent data where each column is an observation and each row is a variable. Let X be in row centered form, i.e., the mean of each row has been subtracted from that row yielding E[ X i,: ] =0 . Let Y= U T X represent a projected version of X .
U T X= U T ( UD V T ) =D V T ( n-1 ) cov( Y ) =Y Y T =D V T (D V T ) T =D V T V D T =DD (4.2)
Y= U T X diagonalizes the covariance when observations are along columns and the data is row centered.

PCA

Let X be row centered data, i.e., columns of X are observations. Find a projection Y=PX that “best” expresses the data. Best means: 1) Assume that SNR= σ signal 2 / σ noise 2 has to be maximized. Further assume that the largest variance is because of the signal and the noise is uncorrelated with the signal. 2) Reduce redundancy. If the projection converts the data to a non-redundant form then each projected variable is independent of other projected variables, so the covariance matrix of Y will be diagonal. Equation 4.2 shows that P= U T is the required projection matrix. Alternate derivation: ( n-1 ) cov( Y ) =Y Y T =PX(PX ) T =PX X T P T (5.1)
Let E=Eig( X X T ) with eigen values in a diagonal matrix λ
X X T E=E λ
X X T E E T =X X T =E λ E T (5.2)
So the covariance matrix can be factored into a product with a diagonal matrix at the center.
With P= E T , equations 5.1, 5.2 give: PX X T P T = E T E λ E T E= λ
So this choice of P diagonalizes the covariance matrix. The rows of P (cols of E ) are the principal components of X . By comparison with equation 4.2, E=Eig( X X T ) =U . Similarly when the data is column centered and the observations are along the rows, right multiplication by V is the apropriate projection according to equation 4.1.

Linear Algebra Preliminaries

6.1 Column Orthonormal Matrix

If V n × r is column orthonormal, V T V= I r × r .
rank( I r × r ) =r
rank( V T V ) =rank( V V T ) =min( n,r )
Therefore nr . When n<r , V cannot be column orthonormal. Since V V T = I n × n , (and rank( I n × n ) =n ) this can be true only when n=r . Thus V T V=V V T =I only when V is column orthonormal and square. Otherwise it can be made column orthonormal by padding on ( n-r ) additional columns of orthornormal vectors.

6.2 Properties of Inverse

A A -1 =I
(A A -1 ) T = A T A - 1 T = I T =I
A T -1 = A - 1 T (6.1)
Let A be symmetric, A T =A . Equation 6.1 gives:
A T -1 = A -1 = A - 1 T
When A is symmetric A -1 if it exists is also symmetric.

6.3 Eigen Vectors of a Symmetric Matrix

Eigen vectors corresponding to distinct eigen values of a symmetric matrix are orthogonal.
Let A v i = λ i , A v j = λ j , λ i λ j , A= A T
v i T A v j = v i T λ j v j = λ j v i T v j
v i T A v j =( A T v i ) T v j =(A v i ) T v j = λ i v i T v j
λ i v i T v j = λ j v i T v j
( λ i - λ j ) v i T v j =0
Since λ i λ j , v i T v j =0

6.4 Subtracting out components to make residual orthogonal

To subtract all components of vector a along the direction b ˆ leaving the residual orthogonal to b ˆ : r=a-( a b ˆ ) b ˆ
Proof: r b ˆ =( a-( a b ˆ ) b ˆ ) b ˆ =a b ˆ -( a b ˆ ) b ˆ b ˆ =a b ˆ -a b ˆ =0
Let A m × n , b ˆ n × 1 , so A contains one observation on each row. To orthogonalize the residual along the rows: A=A-( A b ˆ ) b ˆ T (6.2) Here ( A b ˆ ) gives the components of each row of A along b ˆ , b ˆ T flips b ˆ for subtraction along rows of A and the product gives scaled versions of b ˆ for subtraction along the rows of A .
Proof:
A b ˆ =A-( A b ˆ ) b ˆ T b ˆ =Ab-( A b ˆ ) 1=0
Let A m × n , b ˆ m × 1 , so A contains one observation on each column. To orthogonalize the residual along the columns: A=A- b ˆ b ˆ T A .
Proof: Let C= A T . Equation 6.2 gives:
A T = A T -(( A b ˆ ) b ˆ T ) T = A T = A T - b ˆ b ˆ T A T
C=C- b ˆ b ˆ T C

PLS Regression

For PCR, the Y values are an afterthought. First explain the structure of X using principal components, then regress from projected X to Y. PLS considers X and Y simultaneously by extracting their shared structure. It is assumed that X m × n and Y m × d are already column centered. X=T P + X 1 (7.1)
Y=T Q + Y 1 (7.2)
X 1 and Y 1 are the residuals, T are the “latent components” and P and Q are the “loadings”. Let xy stand for x= y | y | , an operator that extracts a unit vector.

7.1 NIPALS (Iterative PLS)

Let X 0 =X , Y 0 =Y , w is a unit vector along which X is projected, c is a unit vector along which Y is projected, t is the unit vector of the projected X , u is the projected Y . Initialize u= the first column of Y i-1 (or set w to a random unit vector in step 1 below). The goal is to maximize cov( t,u ) for each extracted latent component. Repeat to convergence to extract the i th latent component:
  1. w X i-1 u
  2. t X i-1 w
  3. c Y i-1 t
  4. u= Y i-1 c
Let p= X i t , q= Y i u , b= u t .
Deflate: X i = X i-1 -t t X i-1 = X i-1 -t p , Y i = Y i-1 -bt t Y i-1 = Y i-1 -bt c . So the residual is formed by removing everything in the direction of t from the columns of X i . Deflation of Y i is similar except t factors scaled up to the dot product of t and u are removed. Deflation assures that T T=I . Proof: t X i = t ( X i-1 -t t X i-1 ) = t X i-1 - t t t X i-1 = t X i-1 -( t t ) t X i-1 = t X i-1 - t X i-1 = 0 since ( t t ) =1 .
T=[ | t i | ]
P=[ | p i | ]
Q=[ | q i | ]
D= [ b 1 0 0 b 2 0 0 0 0 b k ] k × k
After extracting k latent components, when the residual X i is small, 7.1 gives:
X ˆ =X=T P
XP=T P P
XP( P P ) -1 =T
B=P( P P ) -1 D C
Y ˆ =TD C =XB (7.3)

7.2 NIPALS: Theory of Operation

Start in step 4 and assume we already know c . The algorithm needs to maximize cov( t,u ) tu= t u= w X i-1 u . This is maximized when w is the unit vector along X i-1 u justifying step 1. Now assume w is known. The algorithm needs to maximize cov( u,t ) ut= u t= c Y i-1 t . This is maximized when c is the unit vector along Y i-1 t justifying step 3.
Starting in step 1 and working backwards: w X i-1 u X i-1 Y i-1 c X i-1 Y i-1 Y i-1 t X i-1 Y i-1 Y i-1 X i-1 w . This is called power-iteration. Starting with a random unit vector w and iterating will on convergence result in: w=Dominanteigenvectorof X i-1 Y i-1 Y i-1 X i-1 (7.4) c=Dominanteigenvectorof Y i-1 X i-1 X i-1 Y i-1 (7.5)
Proving properties of this algorithm is diffcult (but possible) because after extracting the first latent component the algorithm operates on residuals obtained through deflation which obfuscates the connection to the original X and Y . Knowing that the iteration converges on dominant eigen vectors we can formulate an SVD based version named SIMPLS. This is identical in the first latent factor, but diverges from NIPALS for later factors because of a different deflation scheme. Once SIMPLS is understood it can also be implemented using power-iteration instead of SVD if necessary. Other advantages of SIMPLS: a) The psuedo-inverse P( P P ) -1 need not be calculated. b) For large data sets repeated deflation of X and Y can be problematic. In SIMPLS only the covariance X Y which is smaller is deflated. c) Convenient to find objective function since all relevant vectors can be directly expressed in terms of X and Y unlike NIPALS which uses residuals X i and Y i .

SIMPLS (Binu's Version)

# Inputs: X, Y both column centered.
# rcond = Parameter to decide when to stop extracting latent components
R,V,T,Q=[ ]
S= X Y
k = number of columns of X
for i in range(k):
U1, D1, VT1 = svd(S)
# Columns of U1 = Eigen vectors of S S
# Rows of VT1 = Eigen vectors of S S
w = first column of U1
c = first row of VT1
λ =D1[0,0 ] 2 # Largest singular value squared = eigenvalue of w,c
# Use first eigen value to set a threshold. When we fall below
# the threshold, stop extracting latent components
if i == 0: θ = λ × rcond
else:
if λ < θ :
k = i
break
x=Xw # Project X
t= x | x | # Keep t as a unit vector
r= w | x | # Re-scale w to ensure XR=T
v=p= X t
q= Y t
if i > 0:
v=v-V V v # Deflate v wrt previous v vectors in V
v ˆ = v | v | # Convert v to a unit vector
S=S-v v S
Append r,v,t,q to R,V,T,Q as columns respectively
# Optional: append p to columns of P if we need to reconstruct X ˆ
B=R Q # Regression matrix. Y ˆ =XB
return B
As with NIPALS:
X=T P + X 1 (8.1)
Y=T Q + Y 1 (8.2)
X ˆ =T P (8.3)
Y ˆ =T Q (8.4)
By construction the following are ensured:
XR=T (8.5)
T T=I (8.6)
Substituting 8.5 into 8.4 gives
Y ˆ =XB=XR Q (8.7)
B=R Q (8.8)

8.1 SIMPLS: Theory of Operation

SIMPLS works by deflating the covariance matrix instead of using residuals X i , Y i like in NIPALS. Let S i , T i , V i , R i , Q i represent the values of the respective matrices on entry to the loop iteration i. On exit from the loop these are updated for the next iteration. The vectors w i , t i , v i , r i , q i are computed during loop iteration i .
Let S i = residual covariance matrix, S 0 = X Y
Let, U1,D1,VT1=svd( S i ) . Then columns of U1 = eigen vectors of S S = X Y Y X . Rows of VT1 = eigen vectors of S S= Y X X Y .
Two interpretations of w,c are possible.
  1. As with NIPALS, on the first iteration w and c are eigen vectors of X Y Y X and Y X X Y respectively. This does not hold after the first extraction.
  2. λ w c is the largest component/rank 1 approximation of the residual covariance. So the algorithm is attempting to deconstruct the covariance matrix. We cannot subtract out λ w c from the covariance though because there is some regression coefficient that goes from projected X to projected Y. Subtracting out λ w c will cause T TI . So we need to further deconstruct this and also decide on a deflation strategy for S .
By constraint equation 8.6, tx=Xw , so we ensure t t=1 by doing t= x |x| . To satisfy constraint equation 8.5, we do r= w |x| so that Xr=X w |x| =t= Xw |x| . When residuals are 0 , X= X ˆ and Y= Y ˆ . Equations 8.3, 8.4 and 8.6give:
X =(T P ) =P T
X T=P T T=P
P= X T
Y =(T Q ) =Q T
Y T=Q T T=Q
Q= Y T
So the columns of P,Q should consist of p i = X i t i and q i = Y i t i respectively. The residual vanishes only when X = P T= X T T and Y = Q T= Y T T which happens when T has accumulated enough components to become square with T T=I=T T . Until then only the constraint T T=I is ensured by construction of T . Having extracted one latent component, we need to deflate S by some vector and then we can move on to the next iteration.

8.2 Choice of Deflation Vector

Let v i be some unit vector used to deflate columns of S . We will derive the right choice of v i to ensure T T=I .
Let the columns of V be the past deflation vectors. The most general scheme is to first deflate v i with respect to prior vectors to ensure v i V= 0 , then use the deflated version of v i to deflate S .
Let V i T V i = I n by construction.
γ i = v i - V i V i v i
V i T γ i = V i T ( v i - V i V i v i ) = V i T v i - V i T V i V i v i = 0
Append the unit vector of γ i to generate V i+1
V i+1 =[ V i ; γ ˆ 1 ]
This ensures that V i+1 T V i+1 =I since V i T v 1 = 0 , V i T V i =I , v ˆ 1 T v ˆ 1 =1 . The algorithm incrementally deflates S , but because the columns of V are orthogonal to each other we can also do this directly from the original S= S 0 .
S i+1 =S- V i+1 V i+1 T S=( I n - V i+1 V i+1 T ) S
Because w i+1 is an eigen vector of S i+1 S i+1 :
λ i+1 w i+1 = S i+1 S i+1 w i+1
r i+1 is a scaled version of w i+1 . So for some scalar a
r i+1 =a S i+1 S i+1 r i+1
t i+1 X r i+1 =aX S i+1 S i+1 r i+1
T i+1 t i+1 = T i+1 aX S i+1 S i+1 r i+1 =a T i+1 X( I n - V i+1 V i+1 T ) S S i+1 r i+1
T i+1 t i+1 =a T i+1 X( I n - V i+1 V i+1 T ) S S i+1 r i+1 =a( T i+1 X- T i+1 X V i+1 V i+1 T ) S S i+1 r i+1 (8.9)
To ensure constraint equation T T=I is satisfied we need T i+1 t i+1 = 0 so that each latent component is orthogonal to previous components. If it were the case that X T i+1 = V i+1 , then equation 8.9 becomes:
T i+1 t i+1 =a( V i+1 - V i+1 V i+1 V i+1 T ) S S i+1 r i+1 =a( V i+1 - I n V i+1 T ) S S i+1 r i+1 = 0
So the choice of X T i+1 = V i+1 or equivalently X T i = V i ensures the required constraint. Therefore the right choice of deflation vector is: v i = X t i .