Thursday, April 30, 2009

Hierarchical Clustering in Python

Continuing on the topic of clustering algorithms, here is another popular choice with a Python implementation. No need to copy paste, the all the code can be found here.

Hierarchical clustering is another simple but powerful clustering algorithm. The idea is to build a similarity tree based on pairwise distances. The algorithm starts with grouping the two closest objects (based on the distance between feature vectors) and creates an "average" node in a tree with the two objects as children. Then the next closest pair is found among the remaining objets but also including any average nodes, and so on. At each node the distance between the two children is also stored. Clusters can then be extracted by traversing this tree and stopping nodes with distance smaller some threshold.

Hierarchical clustering has several benefits. For example the tree structure can be used to visualize relationships and how clusters are related. A good feature vector will give a nice separation in the tree. Another benefit is that the tree can be used with different cluster thresholds without recomputing the tree. The drawback, however, is that one needs to choose a threshold if the actual clusters are needed.

Let's see what this looks like in code. Create a file hcluster.py and add the following code (partially taken and modified from the example in "Programming Collective Intelligence" by Toby Segaran (O'Reilly Media 2007, page 33).

from numpy import *

class cluster_node:
def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):
self.left=left
self.right=right
self.vec=vec
self.id=id
self.distance=distance
self.count=count #only used for weighted average

def L2dist(v1,v2):
return sqrt(sum((v1-v2)**2))

def hcluster(features,distance=L2dist):
#cluster the rows of the "features" matrix
distances={}
currentclustid=-1

# clusters are initially just the individual rows
clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]

while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)

# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]

if d < closest:
closest=d
lowestpair=(i,j)

# calculate the average of the two clusters
mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \
for i in range(len(clust[0].vec))]

# create the new cluster
newcluster=cluster_node(array(mergevec),left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)

# cluster ids that weren't in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)

return clust[0]

Running hcluster() on a matrix with feature vectors as rows will create and return the cluster tree. The choice of distance measure depends on the actual feature vectors, here we used the Euclidean distance but you can create any function and use that as parameter.

To extract the clusters from the tree you need to traverse the tree from the top until a node with distance value smaller than some threshold is found. This is easiest done recursively.

def extract_clusters(clust,dist):
# extract list of sub-tree clusters from hcluster tree with distance < dist
clusters = {}
if clust.distance < dist:
# we have found a cluster subtree
return [clust]
else:
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = extract_clusters(clust.left,dist=dist)
if clust.right!=None:
cr = extract_clusters(clust.right,dist=dist)
return cl+cr

This function will return a list of sub-trees containing the clusters. To get the leaf nodes that contain the object ids, traverse each sub-tree and return a list of leaves using:

def get_cluster_elements(clust):
# return ids for elements in a cluster sub-tree
if clust.id>0:
# positive id means that this is a leaf
return [clust.id]
else:
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = get_cluster_elements(clust.left)
if clust.right!=None:
cr = get_cluster_elements(clust.right)
return cl+cr

Let's try this on a simple example to see it all in action. The file sunsets.zip contains 100 images downloaded from Flickr using the tag "sunset" or "sunsets". For this example we will use the average color of each image as feature vector. This is crude and simple but good enough for illustrating what hierarchical clustering does. Try running the following code in a folder containing the sunset images.

import os
from PIL import Image
from numpy import *
import hcluster


#create a list of images
imlist = []
for filename in os.listdir('./'):
if os.path.splitext(filename)[1] == '.jpg':
imlist.append(filename)
n = len(imlist)

#extract feature vector for each image
features = zeros((n,3))
for i in range(n):
im = array(Image.open(imlist[i]))
R = mean(im[:,:,0].flatten())
G = mean(im[:,:,1].flatten())
B = mean(im[:,:,2].flatten())
features[i] = array([R,G,B])

tree = hcluster.hcluster(features)

To visualize the cluster tree, one can draw a dendrogram. This often gives useful information on how good a given descriptor vector is and what is considered similar in a particular case. Add the following code (also adapted from Segaran's book).

from PIL import Image,ImageDraw

def drawdendrogram(clust,imlist,jpeg='clusters.jpg'):
# height and width
h=getheight(clust)*20
w=1200
depth=getdepth(clust)

# width is fixed, so scale distances accordingly
scaling=float(w-150)/depth

# Create a new image with a white background
img=Image.new('RGB',(w,h),(255,255,255))
draw=ImageDraw.Draw(img)

draw.line((0,h/2,10,h/2),fill=(255,0,0))

# Draw the first node
drawnode(draw,clust,10,(h/2),scaling,imlist,img)
img.save(jpeg)

def drawnode(draw,clust,x,y,scaling,imlist,img):
if clust.id<0:
h1=getheight(clust.left)*20
h2=getheight(clust.right)*20
top=y-(h1+h2)/2
bottom=y+(h1+h2)/2
# Line length
ll=clust.distance*scaling
# Vertical line from this cluster to children
draw.line((x,top+h1/2,x,bottom-h2/2),fill=(255,0,0))

# Horizontal line to left item
draw.line((x,top+h1/2,x+ll,top+h1/2),fill=(255,0,0))

# Horizontal line to right item
draw.line((x,bottom-h2/2,x+ll,bottom-h2/2),fill=(255,0,0))

# Call the function to draw the left and right nodes
drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)
drawnode(draw,clust.right,x+ll,bottom-h2/2,scaling,imlist,img)
else:
# If this is an endpoint, draw a thumbnail image
nodeim = Image.open(imlist[clust.id])
nodeim.thumbnail((20,20))
ns = nodeim.size
img.paste(nodeim,(x,y-ns[1]//2,x+ns[0],y+ns[1]-ns[1]//2))

The dendrogram is then drawn like this.

hcluster.drawdendrogram(tree,imlist,jpeg='sunset.jpg')

This should give an image like the one below.


Hope this was useful. I have one final small thing about clustering I want to write, something for next week maybe.

11 comments:

  1. Hi,

    Thanks for posting your code. I was wondering if I could trouble with a quick question: if I have a tab-delimited text file with the matrix, how do I do that with your code? I'm sorry for the noob question, I'm just learning python. :)

    Thanks!

    Greg

    ReplyDelete
  2. Hi Greg

    You can do this directly using NumPy.

    >>> from numpy import *
    >>> a = loadtxt('yourfile.txt')

    ReplyDelete
  3. Howdy,
    I tried running the example you give and I receive an error "AttributeError: 'module' object has no attribute 'hcluster'" being thrown from the line "tree = hcluster.hcluster(features)

    Any idea what might be going wrong? Thanks!
    - Fincher

    ReplyDelete
  4. @Finchler Check that the directory containing hcluster.py is in your PYTHONPATH (or in the same directory as your script). (e.g. type "env" in a linux/mac terminal to see PYTHONPATH)

    ReplyDelete
  5. Great code, but I think, I found a bug in this code.

    The bug is in get_cluster_elements function.
    The first if statement should be greater or equal
    > if clust.id>=0:
    and not just greater as it is posted.

    This is because the ids are being set by the range function from 0 (zero) to the length of the features minus 1(one).

    So the feature with id=0 is not returned

    The same problem is in drawnode function too. At the if clause again.

    ReplyDelete
  6. Hi,

    I got an error message when calling hcluster.drawdendrogram(...)

    "NameError: global name 'getheight' is not defined"
    from which module is this function?

    ReplyDelete
  7. @Anonymous: The get height and width functions for this example would look like:

    def get_height(node):
    """ Return the height of a node. """

    if node.left==None and node.right==None:
    return 1
    else: # set height to sum of each branch
    return get_height(node.left)+get_height(node.right)


    def get_depth(node):
    """ Return the depth of a node. """

    if node.left==None and node.right==None:
    return 0
    else: # max of each child plus own distance
    return max(get_depth(node.left),get_depth(node.right))+node.distance

    Hope that helps. There is a complete rewrite of hcluster in my upcoming book. Worth a look.
    http://www.maths.lth.se/matematiklth/personal/solem/book.html

    ReplyDelete
  8. Nice tiny code !

    I'm playing with it, I noticed that "count" looks unused there...

    ReplyDelete
  9. Where should this line be placed in the specific program?
    hcluster.drawdendrogram(tree,imlist,jpeg='sunset.jpg')

    Either in hcluster.py or run in the command shell??

    ReplyDelete