Linear Algebra Notes

Linear Algebra: Notes on the Singular Value Decomposition (SVD), Principal Component Analysis (PCA), Partial Least Squares Regression (PLS) and SIMPLS

1 SVD

Let

X^{m \times n}

V = {[\begin{matrix} | \\ \dots & v_{i} & \dots \\ | \end{matrix}]}^{n \times r}

| v_{i} | = 1

Let

w_{i} = X v_{i}

σ_{i} = | w_{i} |

Let

u_{i} = \hat{w_{i}} = \frac{w_{i}}{| w_{i} |} = \frac{w_{i}}{σ_{i}}

Let

U = {[\begin{matrix} | \\ \dots & u_{i} & \dots \\ | \end{matrix}]}^{m \times r}

D = {[\begin{matrix} σ_{1} & 0 & \dots \\ 0 & σ_{2} & \dots \\ 0 & 0 & \dots & ⋮ \\ 0 & 0 & \dots & σ_{r} \end{matrix}]}^{r \times r}

By convention, but not by mathematical necessity

σ_{i} \geq σ_{j} \forall i < j

X v_{i} = w_{i} = u_{i} σ_{i}

\begin{matrix} X V = U D & (1.1) \end{matrix}

With

V^{n \times r} = E i g (X^{T} X)

when

X^{T} X

is of rank

r

X^{T} X v_{i} = λ_{i} v_{i}

v_{i} \cdot v_{j} = δ_{i j}

V^{T} V = I^{r \times r}

See Section 6.1:

\begin{matrix} V V^{T} = I^{n \times n} & (1.2) \end{matrix}

w_{i} \cdot w_{i} = w_{i}^{T} w_{i} = (X v_{i})^{T} (X v_{i}) = v_{i}^{T} (X^{T} X v_{i}) = λ_{i} v_{i}^{T} v_{i} = λ_{i}

σ_{i} = | w_{i} | = \sqrt{λ_{i}}

u_{i} = \hat{w_{i}} = \frac{w_{i}}{| w_{i} |} = \frac{X v_{i}}{\sqrt{λ_{i}}}

u_{i} \cdot u_{j} = u_{i}^{T} u_{j} = \frac{1}{\sqrt{λ_{i} λ_{j}}} (X v_{i})^{T} (X v_{j}) = \frac{1}{\sqrt{λ_{i} λ_{j}}} v_{i}^{T} (X^{T} X v_{j}) = \frac{1}{\sqrt{λ_{i} λ_{j}}} v_{i}^{T} λ_{j} v_{j} = \sqrt{\frac{λ_{j}}{λ_{i}}} v_{i}^{T} v_{j} = δ_{i j}

U^{T} U = I^{r \times r}

U U^{T} = I^{m \times m}

Equations 1.1 1.2 give:

X V = U D

X V V^{T} = X I^{n \times n} = U D V^{T}

The SVD is:

\begin{matrix} X = U D V^{T} & (1.3) \end{matrix}

Cols of

U

are the left singular vectors, cols of

V

(rows of

V^{T}

) are the right singular vectors and

d i a g (D)

are the singular values of

X

. The condition number of

X

κ (X) = \frac{σ_{1}}{σ_{r}}

2 Generalized Inverse

Let

D^{- 1} = {[\begin{matrix} \frac{1}{σ_{1}} & 0 & \dots \\ 0 & \frac{1}{σ_{2}} & \dots \\ 0 & 0 & \dots & ⋮ \\ 0 & 0 & \dots & \frac{1}{σ_{r}} \end{matrix}]}^{r \times r}

\begin{matrix} D D^{- 1} = D^{- 1} D = I^{r \times r} & (2.1) \end{matrix}

Equations 1.3, 2.1 give:

X = U D V^{T}

X (V D^{- 1} U^{T}) = U D V^{T} V D^{- 1} U^{T} = I^{m \times m}

X_{r i g h t}^{- 1} = V D^{- 1} U^{T}

(V D^{- 1} U^{T}) X = V D^{- 1} U^{T} U D V^{T} = I^{n \times n}

X_{l e f t}^{- 1} = V D^{- 1} U^{T}

X^{- 1} = V D^{- 1} U^{T}

When

D

is of rank

r < m i n (m, n)

, for computational purposes, it is ok to extend

D

to a larger square matrix and only invert the non-zero entries and leave the zeros intact to get

D^{- 1}

3 Low Rank Approximation

Let

{\hat{D}}_{i j} = D_{i j} 1_{D_{i j} < τ}

, i.e., singular values less than a threshold are set to 0. Then the low rank approximation of

X

is given by:

\hat{X} = \hat{U} \hat{D} {\hat{V}}^{T}

For computational efficiency,

\hat{D}

may be replaced with a smaller

k \times k

version when there are

k

remaining singular values and only the first

k

cols of

\hat{U}

and the first

k

rows of

{\hat{V}}^{T}

need to be retained. This can also be used for compression and denoising.

4 Diagonalizing Covariance

Let

X

represent data where each row is an observation and each column is a variable. Let X be in column centered form, i.e., the mean of each column has been subtracted from that column yielding

E [X_{:, j}] = 0

. Let

Y = X V

represent a projected version of

X

X V = (U D V^{T}) V = U D

\begin{matrix} (m - 1) c o v (Y) = Y^{T} Y = (U D)^{T} U D = D^{T} U^{T} U D = D D & (4.1) \end{matrix}

Y = X V

diagonalizes the covariance when observations are along rows and the data is column centered.

Let

X

represent data where each column is an observation and each row is a variable. Let X be in row centered form, i.e., the mean of each row has been subtracted from that row yielding

E [X_{i, :}] = 0

. Let

Y = U^{T} X

represent a projected version of

X

U^{T} X = U^{T} (U D V^{T}) = D V^{T}

\begin{matrix} (n - 1) c o v (Y) = Y Y^{T} = D V^{T} (D V^{T})^{T} = D V^{T} V D^{T} = D D & (4.2) \end{matrix}

Y = U^{T} X

diagonalizes the covariance when observations are along columns and the data is row centered.

5 PCA

Let X be row centered data, i.e., columns of X are observations. Find a projection

Y = P X

that “best” expresses the data. Best means: 1) Assume that

S N R = σ_{s i g n a l}^{2} / σ_{n o i s e}^{2}

has to be maximized. Further assume that the largest variance is because of the signal and the noise is uncorrelated with the signal. 2) Reduce redundancy. If the projection converts the data to a non-redundant form then each projected variable is independent of other projected variables, so the covariance matrix of Y will be diagonal. Equation 4.2 shows that

P = U^{T}

is the required projection matrix. Alternate derivation:

\begin{matrix} (n - 1) c o v (Y) = Y Y^{T} = P X (P X)^{T} = P X X^{T} P^{T} & (5.1) \end{matrix}

Let

E = E i g (X X^{T})

with eigen values in a diagonal matrix

λ

X X^{T} E = E λ

\begin{matrix} X X^{T} E E^{T} = X X^{T} = E λ E^{T} & (5.2) \end{matrix}

So the covariance matrix can be factored into a product with a diagonal matrix at the center.

With

P = E^{T}

, equations 5.1, 5.2 give:

P X X^{T} P^{T} = E^{T} E λ E^{T} E = λ

So this choice of P diagonalizes the covariance matrix. The rows of

P

(cols of

E

) are the principal components of

X

. By comparison with equation 4.2,

E = E i g (X X^{T}) = U

. Similarly when the data is column centered and the observations are along the rows, right multiplication by

V

is the apropriate projection according to equation 4.1.

6 Linear Algebra Preliminaries

6.1 Column Orthonormal Matrix

V^{n \times r}

is column orthonormal,

V^{T} V = I^{r \times r}

r a n k (I^{r \times r}) = r

r a n k (V^{T} V) = r a n k (V V^{T}) = m i n (n, r)

Therefore

n \geq r

. When

n < r

V

cannot be column orthonormal. Since

V V^{T} = I^{n \times n}

, (and

r a n k (I^{n \times n}) = n

) this can be true only when

n = r

. Thus

V^{T} V = V V^{T} = I

only when

V

is column orthonormal and square. Otherwise it can be made column orthonormal by padding on

(n - r)

additional columns of orthornormal vectors.

6.2 Properties of Inverse

A A^{- 1} = I

(A A^{- 1})^{T} = A^{T} A^{- 1^{T}} = I^{T} = I

$\begin{matrix} A^{T^{- 1}} = A^{- 1^{T}} & (6.1) \end{matrix}$

Let

A

be symmetric,

A^{T} = A

. Equation 6.1 gives:

A^{T^{- 1}} = A^{- 1} = A^{- 1^{T}}

When

A

is symmetric

A^{- 1}

if it exists is also symmetric.

6.3 Eigen Vectors of a Symmetric Matrix

Eigen vectors corresponding to distinct eigen values of a symmetric matrix are orthogonal.

Let

A v_{i} = λ_{i}

A v_{j} = λ_{j}

λ_{i} \neq λ_{j}

A = A^{T}

v_{i}^{T} A v_{j} = v_{i}^{T} λ_{j} v_{j} = λ_{j} v_{i}^{T} v_{j}

v_{i}^{T} A v_{j} = (A^{T} v_{i})^{T} v_{j} = (A v_{i})^{T} v_{j} = λ_{i} v_{i}^{T} v_{j}

λ_{i} v_{i}^{T} v_{j} = λ_{j} v_{i}^{T} v_{j}

(λ_{i} - λ_{j}) v_{i}^{T} v_{j} = 0

Since

λ_{i} \neq λ_{j}

v_{i}^{T} v_{j} = 0

6.4 Subtracting out components to make residual orthogonal

To subtract all components of vector

a

along the direction

\hat{b}

leaving the residual orthogonal to

\hat{b}

r = a - (a \cdot \hat{b}) \hat{b}

Proof:

r \cdot \hat{b} = (a - (a \cdot \hat{b}) \hat{b}) \cdot \hat{b} = a \cdot \hat{b} - (a \cdot \hat{b}) \hat{b} \cdot \hat{b} = a \cdot \hat{b} - a \cdot \hat{b} = 0

Let

A^{m \times n}

{\hat{b}}^{n \times 1}

, so

A

contains one observation on each row. To orthogonalize the residual along the rows:

\begin{matrix} A = A - (A \hat{b}) {\hat{b}}^{T} & (6.2) \end{matrix}

Here

(A \hat{b})

gives the components of each row of

A

along

\hat{b}

{\hat{b}}^{T}

flips

\hat{b}

for subtraction along rows of

A

and the product gives scaled versions of

\hat{b}

for subtraction along the rows of

A

Proof:

A \hat{b} = A - (A \hat{b}) {\hat{b}}^{T} \hat{b} = A b - (A \hat{b}) 1 = 0

Let

A^{m \times n}

{\hat{b}}^{m \times 1}

, so

A

contains one observation on each column. To orthogonalize the residual along the columns:

A = A - \hat{b} {\hat{b}}^{T} A

Proof: Let

C = A^{T}

. Equation 6.2 gives:

A^{T} = A^{T} - ((A \hat{b}) {\hat{b}}^{T})^{T} = A^{T} = A^{T} - \hat{b} {\hat{b}}^{T} A^{T}

C = C - \hat{b} {\hat{b}}^{T} C

7 PLS Regression

For PCR, the Y values are an afterthought. First explain the structure of X using principal components, then regress from projected X to Y. PLS considers X and Y simultaneously by extracting their shared structure. It is assumed that

X^{m \times n}

and

Y^{m \times d}

are already column centered.

\begin{matrix} X = T P^{⊤} + X_{1} & (7.1) \end{matrix}

\begin{matrix} Y = T Q^{⊤} + Y_{1} & (7.2) \end{matrix}

X_{1}

and

Y_{1}

are the residuals,

T

are the “latent components” and

P^{⊤}

and

Q^{⊤}

are the “loadings”. Let

x \propto y

stand for

x = \frac{y}{| y |}

, an operator that extracts a unit vector.

7.1 NIPALS (Iterative PLS)

Let

X_{0} = X

Y_{0} = Y

w

is a unit vector along which

X

is projected,

c

is a unit vector along which

Y

is projected,

t

is the unit vector of the projected

X

u

is the projected

Y

. Initialize

u =

the first column of

Y_{i - 1}

(or set

w

to a random unit vector in step 1 below). The goal is to maximize

c o v (t, u)

for each extracted latent component. Repeat to convergence to extract the

i^{t h}

latent component:

$w \propto X_{i - 1}^{⊤} u$
$t \propto X_{i - 1} w$
$c \propto Y_{i - 1}^{⊤} t$
$u = Y_{i - 1} c$

Let

p = X_{i}^{⊤} t

q = Y_{i}^{⊤} u

b = u^{⊤} t

Deflate:

X_{i} = X_{i - 1} - t t^{⊤} X_{i - 1} = X_{i - 1} - t p^{⊤}

Y_{i} = Y_{i - 1} - b t t^{⊤} Y_{i - 1} = Y_{i - 1} - b t c^{⊤}

. So the residual is formed by removing everything in the direction of

t

from the columns of

X_{i}

. Deflation of

Y_{i}

is similar except

t

factors scaled up to the dot product of

t

and

u

are removed. Deflation assures that

T^{⊤} T = I

. Proof:

t^{⊤} X_{i} = t^{⊤} (X_{i - 1} - t t^{⊤} X_{i - 1}) = t^{⊤} X_{i - 1} - t^{⊤} t t^{⊤} X_{i - 1} = t^{⊤} X_{i - 1} - (t^{⊤} t) t^{⊤} X_{i - 1} = t^{⊤} X_{i - 1} - t^{⊤} X_{i - 1} = 0

since

(t^{⊤} t) = 1

T = [\begin{matrix} | \\ \dots & t_{i} & \dots \\ | \end{matrix}]

P = [\begin{matrix} | \\ \dots & p_{i} & \dots \\ | \end{matrix}]

Q = [\begin{matrix} | \\ \dots & q_{i} & \dots \\ | \end{matrix}]

D = {[\begin{matrix} b_{1} & 0 & \dots \\ 0 & b_{2} & \dots \\ 0 & 0 & \dots & ⋮ \\ 0 & 0 & \dots & b_{k} \end{matrix}]}^{k \times k}

After extracting

k

latent components, when the residual

X_{i}

is small, 7.1 gives:

\hat{X} = X = T P^{⊤}

X P = T P^{⊤} P

X P (P^{⊤} P)^{- 1} = T

B = P (P^{⊤} P)^{- 1} D C^{⊤}

\begin{matrix} \hat{Y} = T D C^{⊤} = X B & (7.3) \end{matrix}

7.2 NIPALS: Theory of Operation

Start in step 4 and assume we already know

c

. The algorithm needs to maximize

c o v (t, u) \propto t \cdot u = t^{⊤} u = w^{⊤} X_{i - 1}^{⊤} u

. This is maximized when

w

is the unit vector along

X_{i - 1}^{⊤} u

justifying step 1. Now assume

w

is known. The algorithm needs to maximize

c o v (u, t) \propto u \cdot t = u^{⊤} t = c^{⊤} Y_{i - 1}^{⊤} t

. This is maximized when

c

is the unit vector along

Y_{i - 1}^{⊤} t

justifying step 3.

Starting in step 1 and working backwards:

w \propto X_{i - 1}^{⊤} u \propto X_{i - 1}^{⊤} Y_{i - 1} c \propto X_{i - 1}^{⊤} Y_{i - 1} Y_{i - 1}^{⊤} t \propto X_{i - 1}^{⊤} Y_{i - 1} Y_{i - 1}^{⊤} X_{i - 1} w

. This is called power-iteration. Starting with a random unit vector

w

and iterating will on convergence result in:

\begin{matrix} w = D o m i n a n t e i g e n v e c t o r o f X_{i - 1}^{⊤} Y_{i - 1} Y_{i - 1}^{⊤} X_{i - 1} & (7.4) \end{matrix}

\begin{matrix} c = D o m i n a n t e i g e n v e c t o r o f Y_{i - 1}^{⊤} X_{i - 1} X_{i - 1}^{⊤} Y_{i - 1} & (7.5) \end{matrix}

Proving properties of this algorithm is diffcult (but possible) because after extracting the first latent component the algorithm operates on residuals obtained through deflation which obfuscates the connection to the original

X

and

Y

. Knowing that the iteration converges on dominant eigen vectors we can formulate an SVD based version named SIMPLS. This is identical in the first latent factor, but diverges from NIPALS for later factors because of a different deflation scheme. Once SIMPLS is understood it can also be implemented using power-iteration instead of SVD if necessary. Other advantages of SIMPLS: a) The psuedo-inverse

P (P^{⊤} P)^{- 1}

need not be calculated. b) For large data sets repeated deflation of

X

and

Y

can be problematic. In SIMPLS only the covariance

X^{⊤} Y

which is smaller is deflated. c) Convenient to find objective function since all relevant vectors can be directly expressed in terms of

X

and

Y

unlike NIPALS which uses residuals

X_{i}

and

Y_{i}

8 SIMPLS (Binu's Version)

# Inputs: X, Y both column centered.

# rcond = Parameter to decide when to stop extracting latent components

R, V, T, Q = []

S = X^{⊤} Y

k = number of columns of X

for i in range(k):

U1, D1, VT1 = svd(S)

# Columns of U1 = Eigen vectors of

S S^{⊤}

# Rows of VT1 = Eigen vectors of

S^{⊤} S

w = first column of U1

c = first row of VT1

λ = D 1 [0, 0]^{2}

# Largest singular value squared = eigenvalue of w,c

# Use first eigen value to set a threshold. When we fall below

# the threshold, stop extracting latent components

if i == 0:

θ = λ \times r c o n d

else:

λ < θ

k = i

break

x = X w

# Project X

t = \frac{x}{| x |}

# Keep t as a unit vector

r = \frac{w}{| x |}

# Re-scale w to ensure

X R = T

v = p = X^{⊤} t

q = Y^{⊤} t

if i > 0:

v = v - V V^{⊤} v

# Deflate

v

wrt previous

v

vectors in

V

\hat{v} = \frac{v}{| v |}

# Convert v to a unit vector

S = S - v v^{⊤} S

Append

r, v, t, q

R, V, T, Q

as columns respectively

# Optional: append

p

to columns of

P

if we need to reconstruct

\hat{X}

B = R Q^{⊤}

# Regression matrix.

\hat{Y} = X B

return

B

As with NIPALS:

\begin{matrix} X = T P^{⊤} + X_{1} & (8.1) \end{matrix}

\begin{matrix} Y = T Q^{⊤} + Y_{1} & (8.2) \end{matrix}

\begin{matrix} \hat{X} = T P^{⊤} & (8.3) \end{matrix}

\begin{matrix} \hat{Y} = T Q^{⊤} & (8.4) \end{matrix}

By construction the following are ensured:

\begin{matrix} X R = T & (8.5) \end{matrix}

\begin{matrix} T^{⊤} T = I & (8.6) \end{matrix}

Substituting 8.5 into 8.4 gives

\begin{matrix} \hat{Y} = X B = X R Q^{⊤} & (8.7) \end{matrix}

\begin{matrix} B = R Q^{⊤} & (8.8) \end{matrix}

8.1 SIMPLS: Theory of Operation

SIMPLS works by deflating the covariance matrix instead of using residuals

X_{i}

Y_{i}

like in NIPALS. Let

S_{i}, T_{i}, V_{i}, R_{i}, Q_{i}

represent the values of the respective matrices on entry to the loop iteration i. On exit from the loop these are updated for the next iteration. The vectors

w_{i}, t_{i}, v_{i}, r_{i}, q_{i}

are computed during loop iteration

i

Let

S_{i}

= residual covariance matrix,

S_{0} = X^{⊤} Y

Let,

U 1, D 1, V T 1 = s v d (S_{i})

. Then columns of

U 1

= eigen vectors of

S S^{⊤} = X^{⊤} Y Y^{⊤} X

. Rows of

V T 1

= eigen vectors of

S^{⊤} S = Y^{⊤} X X^{⊤} Y

Two interpretations of

w, c

are possible.

As with NIPALS, on the first iteration $w$ and $c$ are eigen vectors of $X^{⊤} Y Y^{⊤} X$ and $Y^{⊤} X X^{⊤} Y$ respectively. This does not hold after the first extraction.
$λ w c^{⊤}$ is the largest component/rank 1 approximation of the residual covariance. So the algorithm is attempting to deconstruct the covariance matrix. We cannot subtract out $λ w c^{⊤}$ from the covariance though because there is some regression coefficient that goes from projected X to projected Y. Subtracting out $λ w c^{⊤}$ will cause $T^{⊤} T \neq I$ . So we need to further deconstruct this and also decide on a deflation strategy for $S$ .

By constraint equation 8.6,

t \propto x = X w

, so we ensure

t^{⊤} t = 1

by doing

t = \frac{x}{| x |}

. To satisfy constraint equation 8.5, we do

r = \frac{w}{| x |}

so that

X r = X \frac{w}{| x |} = t = \frac{X w}{| x |}

. When residuals are

0

X = \hat{X}

and

Y = \hat{Y}

. Equations 8.3, 8.4 and 8.6give:

X^{⊤} = (T P^{⊤})^{⊤} = P T^{⊤}

X^{⊤} T = P T^{⊤} T = P

P = X^{⊤} T

Y^{⊤} = (T Q^{⊤})^{⊤} = Q T^{⊤}

Y^{⊤} T = Q T^{⊤} T = Q

Q = Y^{⊤} T

So the columns of

P, Q

should consist of

p_{i} = X_{i}^{⊤} t_{i}

and

q_{i} = Y_{i}^{⊤} t_{i}

respectively. The residual vanishes only when

X^{⊤} = P^{⊤} T = X^{⊤} T T^{⊤}

and

Y^{⊤} = Q^{⊤} T = Y^{⊤} T T^{⊤}

which happens when

T

has accumulated enough components to become square with

T^{⊤} T = I = T T^{⊤}

. Until then only the constraint

T^{⊤} T = I

is ensured by construction of

T

. Having extracted one latent component, we need to deflate

S

by some vector and then we can move on to the next iteration.

8.2 Choice of Deflation Vector

Let

v_{i}

be some unit vector used to deflate columns of

S

. We will derive the right choice of

v_{i}

to ensure

T^{⊤} T = I

Let the columns of

V

be the past deflation vectors. The most general scheme is to first deflate

v_{i}

with respect to prior vectors to ensure

v_{i}^{⊤} V = 0

, then use the deflated version of

v_{i}

to deflate

S

Let

V_{i}^{T} V_{i} = I_{n}

by construction.

γ_{i} = v_{i} - V_{i} V_{i}^{⊤} v_{i}

V_{i}^{T} γ_{i} = V_{i}^{T} (v_{i} - V_{i} V_{i}^{⊤} v_{i}) = V_{i}^{T} v_{i} - V_{i}^{T} V_{i} V_{i}^{⊤} v_{i} = 0

Append the unit vector of

γ_{i}

to generate

V_{i + 1}

V_{i + 1} = [V_{i}; {\hat{γ}}_{1}]

This ensures that

V_{i + 1}^{T} V_{i + 1} = I

since

V_{i}^{T} v_{1} = 0

V_{i}^{T} V_{i} = I

{\hat{v}}_{1}^{T} {\hat{v}}_{1} = 1

. The algorithm incrementally deflates

S

, but because the columns of

V

are orthogonal to each other we can also do this directly from the original

S = S_{0}

S_{i + 1} = S - V_{i + 1} V_{i + 1}^{T} S = (I_{n} - V_{i + 1} V_{i + 1}^{T}) S

Because

w_{i + 1}

is an eigen vector of

S_{i + 1} S_{i + 1}^{⊤}

λ_{i + 1} w_{i + 1} = S_{i + 1} S_{i + 1}^{⊤} w_{i + 1}

r_{i + 1}

is a scaled version of

w_{i + 1}

. So for some scalar

a

r_{i + 1} = a S_{i + 1} S_{i + 1}^{⊤} r_{i + 1}

t_{i + 1} \propto X r_{i + 1} = a X S_{i + 1} S_{i + 1}^{⊤} r_{i + 1}

T_{i + 1}^{⊤} t_{i + 1} = T_{i + 1}^{⊤} a X S_{i + 1} S_{i + 1}^{⊤} r_{i + 1} = a T_{i + 1}^{⊤} X (I_{n} - V_{i + 1} V_{i + 1}^{T}) S S_{i + 1}^{⊤} r_{i + 1}

\begin{matrix} T_{i + 1}^{⊤} t_{i + 1} = a T_{i + 1}^{⊤} X (I_{n} - V_{i + 1} V_{i + 1}^{T}) S S_{i + 1}^{⊤} r_{i + 1} = a (T_{i + 1}^{⊤} X - T_{i + 1}^{⊤} X V_{i + 1} V_{i + 1}^{T}) S S_{i + 1}^{⊤} r_{i + 1} & (8.9) \end{matrix}

To ensure constraint equation

T^{⊤} T = I

is satisfied we need

T_{i + 1}^{⊤} t_{i + 1} = 0

so that each latent component is orthogonal to previous components. If it were the case that

X^{⊤} T_{i + 1} = V_{i + 1}

, then equation 8.9 becomes:

T_{i + 1}^{⊤} t_{i + 1} = a (V_{i + 1}^{⊤} - V_{i + 1}^{⊤} V_{i + 1} V_{i + 1}^{T}) S S_{i + 1}^{⊤} r_{i + 1} = a (V_{i + 1}^{⊤} - I_{n} V_{i + 1}^{T}) S S_{i + 1}^{⊤} r_{i + 1} = 0

So the choice of

X^{⊤} T_{i + 1} = V_{i + 1}

or equivalently

X^{⊤} T_{i} = V_{i}

ensures the required constraint. Therefore the right choice of deflation vector is:

v_{i} = X^{⊤} t_{i}