Hey there! It’s been a while since I wrote anything other than stuff about travel (oh, don’t you worry, there’s still more of that coming!), so it feels good to write about something like this.
Right now, I’m almost finished with the Andrew Ng Machine Learning course on Coursera. Maybe I’ll write about it sometime, but it’s really, really solid and I’m learning a lot. He’s pretty great at explaining concepts and the course is constructed pretty well. What I really like is that, for the assignments, he’ll take the concept from that week and demonstrate a really interesting application of it (even if it’s a little contrived and may not actually be a practical use for it). Either way, it just gets me to think about the breadth of what this stuff can be applied to.
So, this week was on Principle Component Analysis (PCA), which is what this article is about. A large part of the beginning will be reiteration of the concept and what the assignment had us do, but then at the end I experiment on my own a little, so skip along to there if you already know this stuff.
PCA sounds useful, but also kind of boring/obvious at first. Let me briefly explain. Basically, imagine you have two dimensional data, with the dimensions x and y. You plot it, and it looks like this (the blue):
Obviously, there’s a lot of correlation between the data in the x and y dimensions. They pretty much follow that red line I also plotted. So, if you don’t mind losing a bit of accuracy, you can “project” the blue data onto the red line, and now (because a line is 1D), to specify the data point, you can just say the distance along the line it is (from some point). So you’ve effectively turned a 2D dataset into a 1D dataset, which is way easier to model.
So at this point, it seemed kind of boring to me. Definitely useful (he mentions that if you’re using Neural Nets, then it’s really good if you can cut the dimensions down), but just kind of boring. Then, it gets really interesting. He talks about taking something with maybe, 1000 dimensions, and using PCA to cut it down to 100, since in real life there’s a good chance some of those features are really correlated (or even just, the same feature with different units, if you work in industry where the data you were given might just be some big spreadsheet where they happened to have the same measurement with different units or something). This made my ears perk up: okay, I get the principle, but how, in practice, would I figure out which of those 1000 features were correlated?
It turns out you use the Covariance Matrix. There are a lot of pretty useful and interesting links that give good intuitive explanations for how the CM gives you the correlation between variables, but I’ll let them explain it better than I ever could. The practical aspect is that, if you have N features, you end up with an N x N matrix. From this, you can get the eigenvalues of the covariance matrix. These are now the new “basis vectors” for your data, and also an N x N matrix (a good way to look at it is, N vectors that each have N entries). If before your (2D, for example) data had the features x and y, then a unit of x or a unit of y were your basis vectors. Any data point can be built out of some amount of x plus some amount of y. But, you can use almost anything (as long as they aren’t orthogonal) as basis vectors, and PCA helps you find the ones that make the most sense. Taking this image from the Andrew Ng PCA assignment (the 2D data he used is more messy than mine above):
Those black lines (the principle components, PCs, or eigenvectors) obviously make more sense as basis vectors for that data than the original features (though obviously, you eventually want answers in terms of your original features). For example, you can tell that, though there’s variance in the direction of the shorter black line, there’s much less so than in the other direction, so if you were going to drop one dimension, it makes more sense to drop that one, since you’ll retain more of the original info.
So that’s already cool to me, because it shows a really easy, automated way to find the correlation in datasets with large numbers of features, which I didn’t know about. But wait! It gets cooler.
Next, he applies it to faces, which I thought was really cool. This part of the assignment was at the end, where it’s not graded and you don’t really have to do any work. The code is completely written, so you just kind of execute it and try to follow along with the idea. The idea is that, for a small (32px x 32px) (black and white) picture of a face, that’s really just equivalent to a 32*32 = 1024 vector of values for the darkness of that pixel. So, in this formulation, the basis vectors are the values of each individual pixel, and there are 1024 pixels, so you need 1024 dimensions.
What if these dimensions are correlated, though? Looking at a few of the faces in the training set he gave, you can quickly see things like symmetries and groupings:
For example, most images have little patches in the corners that are all pretty similar colors. Likewise, most people eyes in roughly the same places in the images. If you can represent a bunch of pixels with the combination of just a couple properties, well, now you’re describing the same thing with much less information.
So, if you have a huge dataset of these faces, you can calculate the covariance matrix for these 1024 pixel arrays (that represent faces) and their PCs. Originally, the basis vectors for these images were the 1024 long arrays (1,0,0,…,0), (0,1,0,…,0), etc (one for each pixel). So what are the PCs exactly?
Well, they’re still constant 1024 long arrays, but the different is that they don’t correspond to one pixel, they correspond to a whole image. He mapped them as well:
HOW COOL IS THAT. Those are basis vectors (the most important 36, there are 1024 total). They look pretty creepy, since they’re basically “general face features”. The first one is the general shape that most faces have. Pretty much all faces are going to need some amount of that pattern. You can see that they get more detailed as you go to less important ones, and they also tend to focus on specific things, like eyebrown or lip shapes.
So, now, the general idea is this. Any given face will be some of PC + some of PC + … + some of PC (where the “some of” weight can also be negative!). But, recall the example above. We plotted the PCs for the 2D example, and it turned out one was much more important than the other. So if you dropped the less important one, you still had most of the information. We can do the same here. We can keep some number, say 100, of the most important faces, and reconstruct the face from only those. Now, because we’ve chosen these ones to be the most important ones of the principle components, we can expect much less degradation of the image from this, compared to removing the same amount of basis vectors (i.e., pixels) in the original formulation. He does exactly that here, keeping only the first 100 PCs:
You can see that, at that small size, dropping 90% of the data in this way leaves you with a pretty similar looking image. Maybe some loss, but not too bad! How cool.
So, this is where the assignment ended with this. But I wanted to both implement it in python (they use matlab/octave…) and also mess around a little on my own and have some fun.
The first thing I did was to implement it in python, and then have it calculate the PCs from the same faces as the assignment. Then, I made it import a square face you give it (experimenting with my own!), black and white it, and resize it to 32×32.
Lemme tell ya, it’s weird looking at your own face so much while doing a thing like this.
Anyway, then, I use the PCA from the assignment faces to find out the PCs of my own face. Something I was curious about was, exactly how many PCs do you need to “recover” your face decently? He showed that they looked pretty good at 100, but those were displayed tiny, and I wanted to see how they improved with different numbers. Here we go:
You can see a few things. One is that the first entry, N = 0 (so, just the first PC) is a really generic, blurry face shape. That one will be there for any face we plug in (though it could be darker or lighter depending on the weight for a given face). Then, it starts adding other ones, and it starts getting more detailed (again, just to be clear: the N value here is the number of PCs I’ve added up to get the face, it’s not just the Nth PC). It honestly looks pretty bad until N = 250 or so, which is strange because his examples looked pretty good with only 100. I thought it was maybe because I’ve made them larger here, but even if I shrink the image, my one at N = 129 still looks worse. Weird.
I also did it as a gif, which is kinda cool because you can see it change continuously:
However, I’ll mostly stick with the still image progressions because they’re a little easier to compare.
So, one thing I was immediately wondering about is the order of the PCs. That is, the eigenvector matrix of the PCs are in some order of those PCs. To be more explicit, once you have the that matrix, you do the matrix multiplication of your face array (which is 1×1024) by the matrix, which gives you another 1024 long vector of the weights of each of those PCs. So, above, when I was only taking some subset (like, the first 100 PCs), I just took the first 100 in the order they were already in the eigenvector matrix.
However, if you think about it, that’s not necessarily the best order. Even if they were automatically in the best order (i.e., the PCs that mattered the most towards the front) from the calculation of that matrix, that’s the best order for the “training set” of the faces you found them from. As you’ll see in a second, I think it actually does automatically order them that way, but still, your face doesn’t necessarily need the same PCs in the same ranked importance as they’re given to you. And, if it doesn’t (that is, other, automatically “lower ranked” ones are more important), then the progression as you add more PCs for the default/automatic/unsorted one should look different (read: worse) than the progression for if you rank them based on the set of weights for your specific face.
So, I did the progression for automatic vs sorted. Here are the unsorted vs sorted ones:
It’s not a ton better, but it’s definitely there. You can see that by about N = 300, they’re pretty much the same. That tells you a couple things. One is that the eigenvectors are already sorted by the weights (from the set of faces that it was calculated from). Another thing is that, for the most part, the weight/PC ordering of your face probably matches a general face pretty well. That is, your weights themselves might be different (you have a different face!), but the order of them (by magnitude) are probably real similar.
We can see this a little more quantitatively by plotting the (magnitude) of the unsorted and sorted weights (for my face), just by their indices. The sorted one is necessarily always decreasing, but the unsorted one is a little “out of order”, in the sense that some PCs have larger weights than ones with smaller indices. Still, it’s like…90% the same.
I’ve plotted it from only 0-200 PCs, because they look really similar after that.
Okay, here’s one more funny little thing. When I was coding this, I had a small bug in my code where when I was importing my original face image, it was accidentally turning it sideways. So, when I would calculate the PCs for it, and do the progression thing… it had to form a sideways face out of vertical face components! Of course, given enough of them, it still can, right? It’s just another basis, but whereas the “face basis” for normal faces is the perfect set of basis vectors, now it’s probably not that great. So, the progression looked like this:
It’s really weird. You can see the same 0th PC in the N = 0 spot, but then up to about N = 129, it still looks kind of like a vertical face; i.e., you can see the components still. Then, it appears to use what normally might be PCs for forming eyebrows or necks to form my (horizontal) hair! So cool. Interestingly, even with this non optimal basis, it’s still looking pretty good after about N = 300 PCs.
Anyway, lastly I thought it would be fun to do it to my friends’ faces, and see how different our PC magnitude curves are.
I thought that maybe, if I looked at the (original/unsorted by magnitude, so they’re in the same order and thus comparable) PC weights, I could notice differences between peoples’ PC weights that would be visible in the images. Here’s the plot:
Again, I’m only looking at the first 30 PC weights because they get pretty small and all about the same after that, which you can see really starting at about 20. Another thing to point out is that here I’m actually plotting the values of the weights, not their magnitudes, so you’ll notice that there are negative values. If a PC here represents a constant pixel map, and its weight is how much you “scale” that map (multiplying each pixel value), then a negative is simply subtracting the pixel value. You can definitely imagine why this would be necessary for faces. So, a PC pixel map can end up being used for both how it looks and its inverse.
So, the weights are all pretty similar. The couple outliers that stuck out to me were at N = 2, where Phil has a significant large negative value when everyone else is pretty close to 0, and at N = 7 when Will has a large negative value, followed a bit by Bobby and Phil, and everyone else is about 0. This gets pretty speculative, but here’s a guess. Looking at the PC for N = 2:
and Phil’s face:
The PC is clearly used for brightness asymmetry, and Phil’s face has that too since his face was lit up a little asymmetrically (maybe. I said it was a guess!).
Looking at N = 7:
Now, this is definitely motivated reasoning, but this one might kind of represent “beardiness”, since Will has a pretty thick beard in this case. Looking at the other two people with N = 7 values away from 0, Bobby and Phil:
They do both have darker bottom-half faces than anyone else. Orrrrrrrrrr maybe I’m totally wrong.
Well, we’ve had a dangerous amount of fun here today, so maybe it’s time to say goodbye. One last thing I’m interested in, that maybe I’ll check soon, is: we did a similar thing for the neural net assignment, having it learn faces. It turned out that with the NN we used, with one hidden layer of 25 units, each hidden unit layer represented a pretty distinct feature/pattern of a face, or at least an area. What I’m wondering after this is, how much overlap there is between the PCs and the features learned by a NN on the same set of faces. Would the hidden layers learn almost exactly the top PCs?