Hierarchical clustering is another simple but powerful clustering algorithm. The idea is to build a similarity tree based on pairwise distances. The algorithm starts with grouping the two closest objects (based on the distance between feature vectors) and creates an "average" node in a tree with the two objects as children. Then the next closest pair is found among the remaining objets but also including any average nodes, and so on. At each node the distance between the two children is also stored. Clusters can then be extracted by traversing this tree and stopping nodes with distance smaller some threshold.

Hierarchical clustering has several benefits. For example the tree structure can be used to visualize relationships and how clusters are related. A good feature vector will give a nice separation in the tree. Another benefit is that the tree can be used with different cluster thresholds without recomputing the tree. The drawback, however, is that one needs to choose a threshold if the actual clusters are needed.

Let's see what this looks like in code. Create a file hcluster.py and add the following code (partially taken and modified from the example in "Programming Collective Intelligence" by Toby Segaran (O'Reilly Media 2007, page 33).

from numpy import *

class cluster_node:

def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):

self.left=left

self.right=right

self.vec=vec

self.id=id

self.distance=distance

self.count=count #only used for weighted average

def L2dist(v1,v2):

return sqrt(sum((v1-v2)**2))

def hcluster(features,distance=L2dist):

#cluster the rows of the "features" matrix

distances={}

currentclustid=-1

# clusters are initially just the individual rows

clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]

while len(clust)>1:

lowestpair=(0,1)

closest=distance(clust[0].vec,clust[1].vec)

# loop through every pair looking for the smallest distance

for i in range(len(clust)):

for j in range(i+1,len(clust)):

# distances is the cache of distance calculations

if (clust[i].id,clust[j].id) not in distances:

distances[(clust[i].id,clust[j].id)]=distance(clust[i].vec,clust[j].vec)

d=distances[(clust[i].id,clust[j].id)]

if d < closest:

closest=d

lowestpair=(i,j)

# calculate the average of the two clusters

mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \

for i in range(len(clust[0].vec))]

# create the new cluster

newcluster=cluster_node(array(mergevec),left=clust[lowestpair[0]],

right=clust[lowestpair[1]],

distance=closest,id=currentclustid)

# cluster ids that weren't in the original set are negative

currentclustid-=1

del clust[lowestpair[1]]

del clust[lowestpair[0]]

clust.append(newcluster)

return clust[0]

Running hcluster() on a matrix with feature vectors as rows will create and return the cluster tree. The choice of distance measure depends on the actual feature vectors, here we used the Euclidean distance but you can create any function and use that as parameter.

To extract the clusters from the tree you need to traverse the tree from the top until a node with distance value smaller than some threshold is found. This is easiest done recursively.

def extract_clusters(clust,dist):

# extract list of sub-tree clusters from hcluster tree with distance < dist

clusters = {}

if clust.distance < dist:

# we have found a cluster subtree

return [clust]

else:

# check the right and left branches

cl = []

cr = []

if clust.left!=None:

cl = extract_clusters(clust.left,dist=dist)

if clust.right!=None:

cr = extract_clusters(clust.right,dist=dist)

return cl+cr

This function will return a list of sub-trees containing the clusters. To get the leaf nodes that contain the object ids, traverse each sub-tree and return a list of leaves using:

def get_cluster_elements(clust):

# return ids for elements in a cluster sub-tree

if clust.id>0:

# positive id means that this is a leaf

return [clust.id]

else:

# check the right and left branches

cl = []

cr = []

if clust.left!=None:

cl = get_cluster_elements(clust.left)

if clust.right!=None:

cr = get_cluster_elements(clust.right)

return cl+cr

Let's try this on a simple example to see it all in action. The file sunsets.zip contains 100 images downloaded from Flickr using the tag "sunset" or "sunsets". For this example we will use the average color of each image as feature vector. This is crude and simple but good enough for illustrating what hierarchical clustering does. Try running the following code in a folder containing the sunset images.

import os

from PIL import Image

from numpy import *

import hcluster

#create a list of images

imlist = []

for filename in os.listdir('./'):

if os.path.splitext(filename)[1] == '.jpg':

imlist.append(filename)

n = len(imlist)

#extract feature vector for each image

features = zeros((n,3))

for i in range(n):

im = array(Image.open(imlist[i]))

R = mean(im[:,:,0].flatten())

G = mean(im[:,:,1].flatten())

B = mean(im[:,:,2].flatten())

features[i] = array([R,G,B])

tree = hcluster.hcluster(features)

To visualize the cluster tree, one can draw a dendrogram. This often gives useful information on how good a given descriptor vector is and what is considered similar in a particular case. Add the following code (also adapted from Segaran's book).

from PIL import Image,ImageDraw

def drawdendrogram(clust,imlist,jpeg='clusters.jpg'):

# height and width

h=getheight(clust)*20

w=1200

depth=getdepth(clust)

# width is fixed, so scale distances accordingly

scaling=float(w-150)/depth

# Create a new image with a white background

img=Image.new('RGB',(w,h),(255,255,255))

draw=ImageDraw.Draw(img)

draw.line((0,h/2,10,h/2),fill=(255,0,0))

# Draw the first node

drawnode(draw,clust,10,(h/2),scaling,imlist,img)

img.save(jpeg)

def drawnode(draw,clust,x,y,scaling,imlist,img):

if clust.id<0:

h1=getheight(clust.left)*20

h2=getheight(clust.right)*20

top=y-(h1+h2)/2

bottom=y+(h1+h2)/2

# Line length

ll=clust.distance*scaling

# Vertical line from this cluster to children

draw.line((x,top+h1/2,x,bottom-h2/2),fill=(255,0,0))

# Horizontal line to left item

draw.line((x,top+h1/2,x+ll,top+h1/2),fill=(255,0,0))

# Horizontal line to right item

draw.line((x,bottom-h2/2,x+ll,bottom-h2/2),fill=(255,0,0))

# Call the function to draw the left and right nodes

drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)

drawnode(draw,clust.right,x+ll,bottom-h2/2,scaling,imlist,img)

else:

# If this is an endpoint, draw a thumbnail image

nodeim = Image.open(imlist[clust.id])

nodeim.thumbnail((20,20))

ns = nodeim.size

img.paste(nodeim,(x,y-ns[1]//2,x+ns[0],y+ns[1]-ns[1]//2))

The dendrogram is then drawn like this.

hcluster.drawdendrogram(tree,imlist,jpeg='sunset.jpg')

This should give an image like the one below.

Hope this was useful. I have one final small thing about clustering I want to write, something for next week maybe.

Hi,

ReplyDeleteThanks for posting your code. I was wondering if I could trouble with a quick question: if I have a tab-delimited text file with the matrix, how do I do that with your code? I'm sorry for the noob question, I'm just learning python. :)

Thanks!

Greg

Hi Greg

ReplyDeleteYou can do this directly using NumPy.

>>> from numpy import *

>>> a = loadtxt('yourfile.txt')

Howdy,

ReplyDeleteI tried running the example you give and I receive an error "AttributeError: 'module' object has no attribute 'hcluster'" being thrown from the line "tree = hcluster.hcluster(features)

Any idea what might be going wrong? Thanks!

- Fincher

@Finchler Check that the directory containing hcluster.py is in your PYTHONPATH (or in the same directory as your script). (e.g. type "env" in a linux/mac terminal to see PYTHONPATH)

ReplyDeleteAh, got it. Thanks!

ReplyDeleteGreat code, but I think, I found a bug in this code.

ReplyDeleteThe bug is in get_cluster_elements function.

The first if statement should be greater or equal

> if clust.id>=0:

and not just greater as it is posted.

This is because the ids are being set by the range function from 0 (zero) to the length of the features minus 1(one).

So the feature with id=0 is not returned

The same problem is in drawnode function too. At the if clause again.

@Anonymous: thanks.

ReplyDeleteHi,

ReplyDeleteI got an error message when calling hcluster.drawdendrogram(...)

"NameError: global name 'getheight' is not defined"

from which module is this function?

@Anonymous: The get height and width functions for this example would look like:

ReplyDeletedef get_height(node):

""" Return the height of a node. """

if node.left==None and node.right==None:

return 1

else: # set height to sum of each branch

return get_height(node.left)+get_height(node.right)

def get_depth(node):

""" Return the depth of a node. """

if node.left==None and node.right==None:

return 0

else: # max of each child plus own distance

return max(get_depth(node.left),get_depth(node.right))+node.distance

Hope that helps. There is a complete rewrite of hcluster in my upcoming book. Worth a look.

http://www.maths.lth.se/matematiklth/personal/solem/book.html

Nice tiny code !

ReplyDeleteI'm playing with it, I noticed that "count" looks unused there...

Where should this line be placed in the specific program?

ReplyDeletehcluster.drawdendrogram(tree,imlist,jpeg='sunset.jpg')

Either in hcluster.py or run in the command shell??

Hi,

ReplyDeletethanks for explanation about h-cluster.

i just run that program, and I got errors like this :

File "D:\latihan\sunsets\flickrtest.py", line 25, in

hcluster.drawdendrogram(tree,imlist,jpeg='sunset.jpg')

File "D:\latihan\sunsets\hcluster.py", line 154, in drawdendrogram

drawnode(draw,clust,10,(h/2),scaling,imlist,img)

File "D:\latihan\sunsets\hcluster.py", line 175, in drawnode

drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)

File "D:\latihan\sunsets\hcluster.py", line 175, in drawnode

drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)

File "D:\latihan\sunsets\hcluster.py", line 175, in drawnode

drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)

File "D:\latihan\sunsets\hcluster.py", line 175, in drawnode

drawnode(draw,clust.left,x+ll,top+h1/2,scaling,imlist,img)

File "D:\latihan\sunsets\hcluster.py", line 182, in drawnode

img.paste(nodeim,(x,y-ns[1]//2,x+ns[0],y+ns[1]-ns[1]//2))

File "C:\Python27\lib\site-packages\PIL\Image.py", line 1103, in paste

self.im.paste(im, box)

TypeError: integer argument expected, got float

.

How can I solce it ?

Hi,

ReplyDeleteI had tried to cluster a text dokument (.txt file) by keywords occurrences,

but It doesn't work. didn't hcluster.py worked for text input ?

cool job! i was searching for something like this, where i could use some user-defined distances. I was hoping that scipy.hierarchi could help, but in the end i found out that some functions there were compiled with some other languages instead of python, so i gave up and turned to your code, which in the end worked well for my own purpose. I didn't get your drawing part, but I changed the output of the while loop into a matrix, which can fit into scipy.hierarchical.dendrogram, the visualization is also great. Thanks anyway!

ReplyDeleteAnd one more word, I was actually confused by you saying "weighted average", so I checked some papers about hierarchical clustering, and by Lance-Williams formula, I think the way you compute "mergevec" belongs to "Median" method, it is kind of "weighted" compared to another method called "Centroid". I got confused because there is also another pair of methods defined by the same formula, called "Unweighted Pair Group Method using

ReplyDeleteArithmetic mean (UPMGA)" and "Weighted Pair Group Method using

Arithmetic mean (WPMGA)". The way to compute "mergevec" by these two ways would take average over all the members in the cluster, instead of divided by 2.