Lecture 20 - Principal Component Analysis (Application)

The goal for this application is:

Find the k-dimensional subspace that minimizes the average square distance between the points xi and the subspace

In reference to the picture above, we want to equivalently maximize the average squared distance of the projection.

centered data

Data is centered if i=1nxi=0

If the data is not centered, you can just subtract the average from each xi and then get a centered data. Specifically:

x=1ni=1nxi

Then let yi=xix. This new data yi is now centered.

How do we Find the Subspace?

We really want to find some orthonormal basis for the subspace U. Namely a v1,...,vk such that it maximizes the projections from U onto Rd, or:

1ni=1nj=1kxi,vj2

is the value we want to maximize.

Define the matrix X such that the rows of the matrix are the data:

X=[x1x2x3]

Given any unit vector u then:

Xu=[x1,ux2,uxn,u]=Xu2

We want to maximize Xu2. Notice:

Xu2=Xu,Xu=(Xu)TXu=uTXTXSymmetric, Set to Au

Thus we want to diagonalize A, A=QDQT:

We can use SVD to determine that X=USVT:

A=XTX=(USVT)TUSVT=(VSTUT)(USVT)=VSTSVT

Recall that STS was a diagonal matrix, so S contains the singular values for A, with right singular vectors from VT. So the right singular vectors are eigenvalues of XTX.

Recall that X=USVT. So then since V is a unitary matrix:

XV=US[x1,v1x1,v2x1,vdxn,v1xn,v2xn,vd]=

What we can do is take any row and chop it off before the d-th entry to obtain the projection of one xi onto the subspace we found:

xi=j=1kxi,vj