Lecture 20 - Principal Component Analysis (Application)
The goal for this application is:
Given a "centered" data x → 1 , . . . , x → n ∈ R d
A given k where 1 ≤ k ≤ d
Find the k -dimensional subspace that minimizes the average square distance between the points x → i and the subspace
In reference to the picture above, we want to equivalently maximize the average squared distance of the projection .
Data is centered if ∑ i = 1 n x → i = 0 →
If the data is not centered, you can just subtract the average from each x → i and then get a centered data. Specifically:
x ― = 1 n ∑ i = 1 n x → i Then let y → i = x → i − x ― . This new data y → i is now centered.
How do we Find the Subspace?
We really want to find some orthonormal basis for the subspace U . Namely a v 1 , . . . , v k such that it maximizes the projections from U onto R d , or:
1 n ∑ i = 1 n ∑ j = 1 k ⟨ x → i , v → j ⟩ 2 is the value we want to maximize.
Define the matrix X such that the rows of the matrix are the data:
X = [ … x → 1 … … x → 2 … … x → 3 … ⋮ ⋮ ⋮ ] Given any unit vector u → then:
X u → = [ ⟨ x → 1 , u → ⟩ ⟨ x → 2 , u → ⟩ ⋮ ⟨ x → n , u → ⟩ ] = ‖ X u → ‖ 2 We want to maximize ‖ X u → ‖ 2 . Notice:
‖ X u → ‖ 2 = ⟨ X u , X u ⟩ = ( X u ) T X u = u T X T X ⏟ Symmetric, Set to A u Thus we want to diagonalize A , A = Q D Q T :
We can use SVD to determine that X = U S V T :
A = X T X = ( U S V T ) T U S V T = ( V S T U T ) ( U S V T ) = V S T S V T Recall that S T S was a diagonal matrix, so S contains the singular values for A , with right singular vectors from V T . So the right singular vectors are eigenvalues of X T X .
Recall that X = U S V T . So then since V is a unitary matrix:
X V = U S [ ⟨ x 1 , v 1 ⟩ ⟨ x 1 , v 2 ⟩ … ⟨ x 1 , v d ⟩ ⋮ ⋮ ⋱ ⋮ ⟨ x n , v 1 ⟩ ⟨ x n , v 2 ⟩ … ⟨ x n , v d ⟩ ] = What we can do is take any row and chop it off before the d -th entry to obtain the projection of one x i onto the subspace we found:
x → i = ∑ j = 1 k ⟨ x i , v j ⟩