Hierarchical clustering is another simple but powerful clustering algorithm. The idea is to build a similarity tree based on pairwise distances. The algorithm starts with grouping the two closest objects (based on the distance between feature vectors) and creates an "average" node in a tree with the two objects as children. Then the next closest pair is found among the remaining objets but also including any average nodes, and so on. At each node the distance between the two children is also stored. Clusters can then be extracted by traversing this tree and stopping nodes with distance smaller some threshold.
Hierarchical clustering has several benefits. For example the tree structure can be used to visualize relationships and how clusters are related. A good feature vector will give a nice separation in the tree. Another benefit is that the tree can be used with different cluster thresholds without recomputing the tree. The drawback, however, is that one needs to choose a threshold if the actual clusters are needed.
Let's see what this looks like in code. Create a file hcluster.py and add the following code (partially taken and modified from the example in "Programming Collective Intelligence" by Toby Segaran (O'Reilly Media 2007, page 33).
from numpy import *
class cluster_node:
def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):
self.left=left
self.right=right
self.vec=vec
self.id=id
self.distance=distance
self.count=count #only used for weighted average
def L2dist(v1,v2):
return sqrt(sum((v1-v2)**2))
def hcluster(features,distance=L2dist):
#cluster the rows of the "features" matrix
distances={}
currentclustid=-1
# clusters are initially just the individual rows
clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]
while len(clust)>1:
lowestpair=(0,1)
closest=distance(clust[0].vec,clust[1].vec)
# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:
distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)
d=distances[(clust[i].id,clust[j].id)]
if d < closest:
closest=d
lowestpair=(i,j)
# calculate the average of the two clusters
mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \
for i in range(len(clust[0].vec))]
# create the new cluster
newcluster=cluster_node(array(mergevec),left=clust[lowestpair[0]],
right=clust[lowestpair[1]],
distance=closest,id=currentclustid)
# cluster ids that weren't in the original set are negative
currentclustid-=1
del clust[lowestpair[1]]
del clust[lowestpair[0]]
clust.append(newcluster)
return clust[0]
Running hcluster() on a matrix with feature vectors as rows will create and return the cluster tree. The choice of distance measure depends on the actual feature vectors, here we used the Euclidean distance but you can create any function and use that as parameter.
To extract the clusters from the tree you need to traverse the tree from the top until a node with distance value smaller than some threshold is found. This is easiest done recursively.
def extract_clusters(clust,dist):
# extract list of sub-tree clusters from hcluster tree with distance < dist
clusters = {}
if clust.distance < dist:
# we have found a cluster subtree
return [clust]
else:
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = extract_clusters(clust.left,dist=dist)
if clust.right!=None:
cr = extract_clusters(clust.right,dist=dist)
return cl+cr
This function will return a list of sub-trees containing the clusters. To get the leaf nodes that contain the object ids, traverse each sub-tree and return a list of leaves using:
def get_cluster_elements(clust):
# return ids for elements in a cluster sub-tree
if clust.id>0:
# positive id means that this is a leaf
return [clust.id]
else:
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = get_cluster_elements(clust.left)
if clust.right!=None:
cr = get_cluster_elements(clust.right)
return cl+cr
Let's try this on a simple example to see it all in action. The file sunsets.zip contains 100 images downloaded from Flickr using the tag "sunset" or "sunsets". For this example we will use the average color of each image as feature vector. This is crude and simple but good enough for illustrating what hierarchical clustering does. Try running the following code in a folder containing the sunset images.
import os
from PIL import Image
from numpy import *
import hcluster
#create a list of images
imlist = []
for filename in os.listdir('./'):
if os.path.splitext(filename)[1] == '.jpg':
imlist.append(filename)
n = len(imlist)
#extract feature vector for each image
features = zeros((n,3))
for i in range(n):
im = array(Image.open(imlist[i]))
R = mean(im[:,:,0].flatten())
G = mean(im[:,:,1].flatten())
B = mean(im[:,:,2].flatten())
features[i] = array([R,G,B])
tree = hcluster.hcluster(features)
To visualize the cluster tree, one can draw a dendrogram. This often gives useful information on how good a given descriptor vector is and what is considered similar in a particular case. Add the following code (also adapted from Segaran's book).
from PIL import Image,ImageDraw
def drawdendrogram(clust,imlist,jpeg='clusters.jpg'):
# height and width
h=getheight(clust)*20
w=1200
depth=getdepth(clust)
# width is fixed, so scale distances accordingly
scaling=float(w-150)/depth
# Create a new image with a white background
img=Image.new('RGB',(w,h),(255,255,255))
draw=ImageDraw.Draw(img)
draw.line((0,h/2,10,h/2),fill=(255,0,0))
# Draw the first node
drawnode(draw,clust,10,(h/2),scaling,imlist,img)
img.save(jpeg)
def drawnode(draw,clust,x,y,scaling,imlist,img):
if clust.id<0:
h1=getheight(clust.left)*20
h2=getheight(clust.right)*20
top=y-(h1+h2)/2
bottom=y+(h1+h2)/2
# Line length
ll=clust.distance*scaling
# Vertical line from this cluster to children
draw.line((x,top+h1/2,x,bottom-h2/2),fill=(255,0,0))
# Horizontal line to left item
draw.line((x,top+h1/2,x+ll,top+h1/2),fill=(255,0,0))
# Horizontal line to right item
draw.line((x,bottom-h2/2,x+ll,bottom-h2/2),fill=(255,0,0))
# Call the function to draw the left and right nodes
drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)
drawnode(draw,clust.right,x+ll,bottom-h2/2,scaling,imlist,img)
else:
# If this is an endpoint, draw a thumbnail image
nodeim = Image.open(imlist[clust.id])
nodeim.thumbnail((20,20))
ns = nodeim.size
img.paste(nodeim,(x,y-ns[1]//2,x+ns[0],y+ns[1]-ns[1]//2))
The dendrogram is then drawn like this.
hcluster.drawdendrogram(tree,imlist,jpeg='sunset.jpg')
This should give an image like the one below.

Hope this was useful. I have one final small thing about clustering I want to write, something for next week maybe.
Hi,
ReplyDeleteThanks for posting your code. I was wondering if I could trouble with a quick question: if I have a tab-delimited text file with the matrix, how do I do that with your code? I'm sorry for the noob question, I'm just learning python. :)
Thanks!
Greg
Hi Greg
ReplyDeleteYou can do this directly using NumPy.
>>> from numpy import *
>>> a = loadtxt('yourfile.txt')
Howdy,
ReplyDeleteI tried running the example you give and I receive an error "AttributeError: 'module' object has no attribute 'hcluster'" being thrown from the line "tree = hcluster.hcluster(features)
Any idea what might be going wrong? Thanks!
- Fincher
@Finchler Check that the directory containing hcluster.py is in your PYTHONPATH (or in the same directory as your script). (e.g. type "env" in a linux/mac terminal to see PYTHONPATH)
ReplyDeleteAh, got it. Thanks!
ReplyDeleteGreat code, but I think, I found a bug in this code.
ReplyDeleteThe bug is in get_cluster_elements function.
The first if statement should be greater or equal
> if clust.id>=0:
and not just greater as it is posted.
This is because the ids are being set by the range function from 0 (zero) to the length of the features minus 1(one).
So the feature with id=0 is not returned
The same problem is in drawnode function too. At the if clause again.
@Anonymous: thanks.
ReplyDeleteHi,
ReplyDeleteI got an error message when calling hcluster.drawdendrogram(...)
"NameError: global name 'getheight' is not defined"
from which module is this function?
@Anonymous: The get height and width functions for this example would look like:
ReplyDeletedef get_height(node):
""" Return the height of a node. """
if node.left==None and node.right==None:
return 1
else: # set height to sum of each branch
return get_height(node.left)+get_height(node.right)
def get_depth(node):
""" Return the depth of a node. """
if node.left==None and node.right==None:
return 0
else: # max of each child plus own distance
return max(get_depth(node.left),get_depth(node.right))+node.distance
Hope that helps. There is a complete rewrite of hcluster in my upcoming book. Worth a look.
http://www.maths.lth.se/matematiklth/personal/solem/book.html
Nice tiny code !
ReplyDeleteI'm playing with it, I noticed that "count" looks unused there...