Monday, January 12, 2009

PCA for Images using Python

I've looked around for a nice and clean implementation of principal component analysis in python but couldn't really find anything that fits. PCA Module looks useful but seems like a lot of stuff for something this simple. So anyway, I decided to spend part of my Monday evening writing a simple implementation. Here it is.

Principal component analysis (PCA) is a useful technique for dimensionality reduction and is optimal in the sense that it represents the variability if the training data with as few dimensions as possible. The resulting projection matrix can be seen as a change of coordinates to a coordinate system where the coordinates are in descending order of importance.

To do PCA on image data, the images need to be converted to a one-dimensional vector representation, e.g. using flatten() in numpy. A tiny 100*100 pixel grayscale image is then a point in a 10,000 dimensional space, a megapixel image has dimensions in the millions, so dimensionality reduction comes handy in many applications.

The flattened images are collected in a single matrix by stacking them, one row for each image. The rows are then centered around the mean image before the computation of the dominant directions. To find the principal components, singular value decomposition (SVD) is usually used but if the dimensionality is high, there is a useful trick that can be used instead since the
SVD computation will be very slow in that case. Here is what it looks like in python code.

from PIL import Image
from numpy import *

def pca(X):
# Principal Component Analysis
# input: X, matrix with training data as flattened arrays in rows
# return: projection matrix (with important dimensions first),
# variance and mean

#get dimensions
num_data,dim = X.shape

#center data
mean_X = X.mean(axis=0)
for i in range(num_data):
X[i] -= mean_X

if dim>100:
print 'PCA - compact trick used'
M = dot(X,X.T) #covariance matrix
e,EV = linalg.eigh(M) #eigenvalues and eigenvectors
tmp = dot(X.T,EV).T #this is the compact trick
V = tmp[::-1] #reverse since last eigenvectors are the ones we want
S = sqrt(e)[::-1] #reverse since eigenvalues are in increasing order
print 'PCA - SVD used'
U,S,V = linalg.svd(X)
V = V[:num_data] #only makes sense to return the first num_data

#return the projection matrix, the variance and the mean
return V,S,mean_X

The threshold for when to switch from SVD to use the trick with computing eigenvectors of the (smaller) covariance matrix X*X^T is a bit arbitrary in this case but that doesn't really matter. The rows of the matrix V are orthogonal and contains the coordinate directions in order of descending variance of the training data.

Let's try this on an example of face images downloaded from the "Labeled Faces in the Wild" data set available at

The file contains 20 selected images of George W Bush, cropped slightly tighter to the face than the originals. Assuming that the filenames of these cropped face images are stored in a list, imlist, the principal components can be computed and shown like this.

from PIL import Image
import numpy
import pylab

im = numpy.array([0])) #open one image to get the size
m,n = im.shape[0:2] #get the size of the images
imnbr = len(imlist) #get the number of images

#create matrix to store all flattened images
immatrix = numpy.array([numpy.array([i])).flatten() for i in range(imnbr)],'f')

#perform PCA
V,S,immean = pca(immatrix)

#mean image and first mode of variation
immean = immean.reshape(m,n)
mode = V[0].reshape(m,n)

#show the images


Note that the images need to be converted back from the one-dimensional representation using reshape(). Running the example should give two plots like these.

The mean image of the 20 sample images and the first mode, i.e. the direction with most variation.

This example is of course a bit silly since, to have something useful, the images should be properly aligned and cropped or masked to remove the influence of unimportant pixels.

Anyway, hope this is useful to someone. I'll write an update with some better examples (I have some ideas for cool data sets) when there is time.


  1. Nice example, got it working with macbook. So could this be used to determine, if one of the two given pictures presents (approximation) Bush? How would you do comparison with Python?

  2. Hi Marko

    Yes, using PCA like this can be used for recognition. You could then take as many modes as you like and use those as a basis for representing (encoding) face images. Then each face is represented by coordinates of the projection on each of these basis images. This is one of the early methods for face recognition called "eigenfaces".

    Comparing faces is then a matter of computing distance between such vectors. The simplest is standard Euclidean distance d = sum((u-v)**2) but there are lots of options here...

  3. Since im taking my first steps on learning this recognition stuff, could you please explain, what is the difference between these two comparing methods:

    1) d = sum((u-v)**2)
    2) d = sum(abs(u-v))

    I have used latter to get difference (0-255) between pixels, when comparing two pictures. Then if i want to get random image placeholder (target) closer to input image, i'll do: output = input + ((target - input)/2) where number 2 is varied by how quick i want target convert toward input.

    Btw. sometimes i dont need at all on my scripts...

  4. @Marko: When it comes to measuring distance between two images/vectors/features etc there are many choices. Two of the most common are the ones you mention (with the exception that you need to take the square root if you want it by the book).

    1) is called L2 distance or Euclidean distance and that's the "normal" distance you should be familiar with from measuring distances in real life. Check e.g.

    2) is called L1 distance or Manhattan distance. Check e.g.

    There are many many other choices and versions of these. Which one to use depends on the problem/application and you usually have to try a few options before you find the best one for each problem. There are a lot of other fundamental differences like 1) is more sensitive to "large errors" but easier to use with optimization techniques, 2) is more robust... but that will turn into a very long answer.

    Perhaps a topic for the future.

  5. Awesome :) Thanks for this easy tutorial!

  6. Hi,
    shouldn't you compute M=dot(X.T,X), as this would correspond to the version where each image corresponds to a point in the eg. 10000 dimensional "pixelspace", whereas the way you do it you have the confusionmatrix for 10000 in a 20 (for 20images) dimensional image space. The latter doesn't make much sense.

    What you usually want to find is the direction in the pixelspace that has maximal variation.

  7. ok I take that back, that was the "trick" you described.

  8. Hello,

    Thanks for sharing your implementation. It helped me a lot to cook up my own.

    Like the above commenter, at first it confused me that your covariance matrix is 20 x 20 rather than 8400 x 8400. In Wikipedia and in the PCA tutorial by Jonathon Shlens (excellent, BTW), X is a M x N matrix where M is the number of dimensions/variables and N is the number of samples/observations. I was expected the pixels to be the dimensions and the images to be the number of samples but in your example this is reversed.

    I work mostly on machine learning applications to text so I may be wrong but I reached the conclusion that by doing it the way you did, we are trying to find the image (V[0] in your code) in a new basis such that it has greatest variance. This image can be used as a prototype/template.

    Here are a few enhancements I made and remarks:

    - You can center the data with a one liner like this:

    X = X - X.mean(axis=1).reshape(-1,1)

    - The documentation of eigh says that the order of the eigenvalues is not guaranteed so I used argsort.

    - Rather than doing V = V[:num_data], I used full_matrices=False as an argument to svd(). This also speeds up things quite a bit.

    - v in your code is the standard deviation.

    - Strictly speaking, dot(X,X.T) should be divided by N to get the covariance.

  9. Thanks for a nice snippet of code!

    1. Could I use and distribute the derived pca function (with due reference) in my own set of python scripts?

    2. Where can I find the citation for the "useful trick"?

  10. Can I ask you some questions about PCA and your python code?

  11. The V array returned from the trick isn't the same as the V array returned from the SVD. They differ in scaling by the magnitude of the eigenvector.

  12. Thanks so much for sharing this idea. Can you throw more light on what the input matrix looks like. Probably an example would help. I am struggling to get the right format for the "flattenend matrix". Thanks

  13. @Anonymous: Here's an example, perhaps more clear now.

    from numpy import *

    def create_image():
    # silly function to create 'images'
    return random.rand(100,100)

    # create random image
    im = create_image()
    print im.shape # should be (100,100)

    # flatten
    im = im.flatten()
    print im.shape # should now be (10000,)

    # create matrix of 20 flattened 'images'
    immatrix = array([create_image().flatten() for i in range(20)])
    print immatrix.shape # should be (20, 10000)
    # so 20 rows of 10000 pixels!

  14. Thanks Erik, much clearer now. Please, do you have any idea how to implement Canny edge detector in python. I tried implementing but the result seem messy. Especially at the "non-maximum suppression level".

  15. Hello, I'm a beginner in the field of pattern recognition and machine learning. Would like to use Support vector machine to classify a set of data matrix. Any help on this?

  16. @Anonymous: Regarding SVM examples. I have a section in my upcoming book that might be of help. Take a look at section 8.3 and let me know if this is a useful introduction for you:

    1. Thanks Jan, you have done a great job. It is a very useful introduction for me. When will the update be available. I didn't quite get the notion of using pickle. I tried it but couldn't get the content of the file. Any help on this.

    2. @Anonymous Pickle is just used to read a set of random 2D points (created at the beginning of the chapter). I wanted to save them to a file so that all experiments were done on the exact same points. You can just run the script under the section "A simple 2D example" to create your own.

    3. What does array represent in the code...?
      "class_2 = 1.2 * randn(n,2) + array([5,1])"

    4. Please it's not clear what some names/variables represent i.e hstack...??? in the code in section (A simple 2D example). Better explanation is appreciated.

    5. @Anonymous: array is just the numpy type, hstack the horizontal stacking function in numpy. See the original code at the top. In the small example in the comments I imported numpy with "from numpy import *" so the namespace is lost.

  17. Hi Erik, thanks for the useful post. How can one read in a sequence of images from a folder (dir) to form an array as an input for the PCA training set. Any help on this is appreciated.

  18. @Anonymous Try something like this:
    data = array([ array( for i in imnames])

  19. What must i do with the input image for compare it with the output V vector. I want compare the output and input with Euclidean distance d = sum((u-v)**2). what represent u?

    PD: Sorry for my English is very poor

  20. In your book you mention that there are ways to quickly work out a "few" eigenvectors. Are you referring to the Nipals algorithm or do you have any tips/links on alternative superfast PCA algorithms (in special for huge datasets and where only few eigenvectors are needed).

  21. @Guapo: Two common methods are the Arnoldi and Lanczos methods. There are efficient implementations of these in ARPACK which is used by scipy.sparse.linalg.eig(). I haven't used it but that would be my first thing to try.

    Documentation is here

  22. With an input array of shape (5,4), why does linalg.svd return a (4,4) eigenvector array while returns one that's (5,4)? Same with the less is produced via svd.

    The approach gives the correct number of eigenvectors for calculating principal components using all of the inputs. Since svd gives one fewer, why would it ever work correctly for this?

    1. @Anonymous See definition of SVD e.g. at

      For a (5,4) matrix, U is (5,5) and V is (4,4). That's just the way the factorization works. In these problems, the feature vectors are rows in the matrix and therefore the basis vectors should have the same dimension (in your case 4).

      SVD and dot are different operations.

  23. Hi
    Thanks for this explanation , but I have some questions related to the implementation of this algorithm , I am try to develop PCA face recognition in JAVA, and I have these questions
    First : when calculating the average face (meanface)should I talk into consideration that I am working on images , so the average of all images vector can’t be calculated in the same way as in algebra , but we should sum the same components for each pixel (ex: red with red , green with green ,…)with each other , then divide the result by the number of image’s vectors.
    It the previous method is true, then what about multiplication of vectors! How should I calculate it when I am dealing with vectors of images?
    Second : let’s suppose that the previous subjection is fault, when I calculated the feature vectors for the images (by using the algebra method as explained in the algorithm )the resulting vectors contains pixels with very huge numbers , that can’t represent any image when trying to retrieve the source images by use of feature vectors and Transformation matrix , so What is my problem!?

  24. Thanks very much, this will be useful to me as I seek to understand PCA. It is quite difficult to find an implementation in Python that is built with few components and is well-commented and explained. I look forward to trying this out.

  25. This comment has been removed by the author.

  26. I am not getting the subplots in a single screen. I get 2 figures in one window, once i close the preceding window i get rest of the character images one in each window. why is this happening?
    I am a starter in Python, this book seemed interesting so working on it. Can i get help on this? Thanks in advance.

  27. I am getting an error:
    IOError: [Errno 21] Is a directory: '/home/aquib/Desktop/PCA'