Thursday, April 30, 2009

Hierarchical Clustering in Python

Continuing on the topic of clustering algorithms, here is another popular choice with a Python implementation. No need to copy paste, the all the code can be found here.

Hierarchical clustering is another simple but powerful clustering algorithm. The idea is to build a similarity tree based on pairwise distances. The algorithm starts with grouping the two closest objects (based on the distance between feature vectors) and creates an "average" node in a tree with the two objects as children. Then the next closest pair is found among the remaining objets but also including any average nodes, and so on. At each node the distance between the two children is also stored. Clusters can then be extracted by traversing this tree and stopping nodes with distance smaller some threshold.

Hierarchical clustering has several benefits. For example the tree structure can be used to visualize relationships and how clusters are related. A good feature vector will give a nice separation in the tree. Another benefit is that the tree can be used with different cluster thresholds without recomputing the tree. The drawback, however, is that one needs to choose a threshold if the actual clusters are needed.

Let's see what this looks like in code. Create a file and add the following code (partially taken and modified from the example in "Programming Collective Intelligence" by Toby Segaran (O'Reilly Media 2007, page 33).

from numpy import *

class cluster_node:
def __init__(self,vec,left=None,right=None,distance=0.0,id=None,count=1):
self.count=count #only used for weighted average

def L2dist(v1,v2):
return sqrt(sum((v1-v2)**2))

def hcluster(features,distance=L2dist):
#cluster the rows of the "features" matrix

# clusters are initially just the individual rows
clust=[cluster_node(array(features[i]),id=i) for i in range(len(features))]

while len(clust)>1:

# loop through every pair looking for the smallest distance
for i in range(len(clust)):
for j in range(i+1,len(clust)):
# distances is the cache of distance calculations
if (clust[i].id,clust[j].id) not in distances:


if d < closest:

# calculate the average of the two clusters
mergevec=[(clust[lowestpair[0]].vec[i]+clust[lowestpair[1]].vec[i])/2.0 \
for i in range(len(clust[0].vec))]

# create the new cluster

# cluster ids that weren't in the original set are negative
del clust[lowestpair[1]]
del clust[lowestpair[0]]

return clust[0]

Running hcluster() on a matrix with feature vectors as rows will create and return the cluster tree. The choice of distance measure depends on the actual feature vectors, here we used the Euclidean distance but you can create any function and use that as parameter.

To extract the clusters from the tree you need to traverse the tree from the top until a node with distance value smaller than some threshold is found. This is easiest done recursively.

def extract_clusters(clust,dist):
# extract list of sub-tree clusters from hcluster tree with distance < dist
clusters = {}
if clust.distance < dist:
# we have found a cluster subtree
return [clust]
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = extract_clusters(clust.left,dist=dist)
if clust.right!=None:
cr = extract_clusters(clust.right,dist=dist)
return cl+cr

This function will return a list of sub-trees containing the clusters. To get the leaf nodes that contain the object ids, traverse each sub-tree and return a list of leaves using:

def get_cluster_elements(clust):
# return ids for elements in a cluster sub-tree
# positive id means that this is a leaf
return []
# check the right and left branches
cl = []
cr = []
if clust.left!=None:
cl = get_cluster_elements(clust.left)
if clust.right!=None:
cr = get_cluster_elements(clust.right)
return cl+cr

Let's try this on a simple example to see it all in action. The file contains 100 images downloaded from Flickr using the tag "sunset" or "sunsets". For this example we will use the average color of each image as feature vector. This is crude and simple but good enough for illustrating what hierarchical clustering does. Try running the following code in a folder containing the sunset images.

import os
from PIL import Image
from numpy import *
import hcluster

#create a list of images
imlist = []
for filename in os.listdir('./'):
if os.path.splitext(filename)[1] == '.jpg':
n = len(imlist)

#extract feature vector for each image
features = zeros((n,3))
for i in range(n):
im = array([i]))
R = mean(im[:,:,0].flatten())
G = mean(im[:,:,1].flatten())
B = mean(im[:,:,2].flatten())
features[i] = array([R,G,B])

tree = hcluster.hcluster(features)

To visualize the cluster tree, one can draw a dendrogram. This often gives useful information on how good a given descriptor vector is and what is considered similar in a particular case. Add the following code (also adapted from Segaran's book).

from PIL import Image,ImageDraw

def drawdendrogram(clust,imlist,jpeg='clusters.jpg'):
# height and width

# width is fixed, so scale distances accordingly

# Create a new image with a white background'RGB',(w,h),(255,255,255))


# Draw the first node

def drawnode(draw,clust,x,y,scaling,imlist,img):
# Line length
# Vertical line from this cluster to children

# Horizontal line to left item

# Horizontal line to right item

# Call the function to draw the left and right nodes
# If this is an endpoint, draw a thumbnail image
nodeim =[])
ns = nodeim.size

The dendrogram is then drawn like this.


This should give an image like the one below.

Hope this was useful. I have one final small thing about clustering I want to write, something for next week maybe.


  1. Hi,

    Thanks for posting your code. I was wondering if I could trouble with a quick question: if I have a tab-delimited text file with the matrix, how do I do that with your code? I'm sorry for the noob question, I'm just learning python. :)



  2. Hi Greg

    You can do this directly using NumPy.

    >>> from numpy import *
    >>> a = loadtxt('yourfile.txt')

  3. Howdy,
    I tried running the example you give and I receive an error "AttributeError: 'module' object has no attribute 'hcluster'" being thrown from the line "tree = hcluster.hcluster(features)

    Any idea what might be going wrong? Thanks!
    - Fincher

  4. @Finchler Check that the directory containing is in your PYTHONPATH (or in the same directory as your script). (e.g. type "env" in a linux/mac terminal to see PYTHONPATH)

  5. Great code, but I think, I found a bug in this code.

    The bug is in get_cluster_elements function.
    The first if statement should be greater or equal
    > if>=0:
    and not just greater as it is posted.

    This is because the ids are being set by the range function from 0 (zero) to the length of the features minus 1(one).

    So the feature with id=0 is not returned

    The same problem is in drawnode function too. At the if clause again.

  6. Hi,

    I got an error message when calling hcluster.drawdendrogram(...)

    "NameError: global name 'getheight' is not defined"
    from which module is this function?

  7. @Anonymous: The get height and width functions for this example would look like:

    def get_height(node):
    """ Return the height of a node. """

    if node.left==None and node.right==None:
    return 1
    else: # set height to sum of each branch
    return get_height(node.left)+get_height(node.right)

    def get_depth(node):
    """ Return the depth of a node. """

    if node.left==None and node.right==None:
    return 0
    else: # max of each child plus own distance
    return max(get_depth(node.left),get_depth(node.right))+node.distance

    Hope that helps. There is a complete rewrite of hcluster in my upcoming book. Worth a look.

  8. Nice tiny code !

    I'm playing with it, I noticed that "count" looks unused there...

  9. Where should this line be placed in the specific program?

    Either in or run in the command shell??

  10. Hi,
    thanks for explanation about h-cluster.
    i just run that program, and I got errors like this :
    File "D:\latihan\sunsets\", line 25, in
    File "D:\latihan\sunsets\", line 154, in drawdendrogram
    File "D:\latihan\sunsets\", line 175, in drawnode
    File "D:\latihan\sunsets\", line 175, in drawnode
    File "D:\latihan\sunsets\", line 175, in drawnode
    File "D:\latihan\sunsets\", line 175, in drawnode
    File "D:\latihan\sunsets\", line 182, in drawnode
    File "C:\Python27\lib\site-packages\PIL\", line 1103, in paste, box)
    TypeError: integer argument expected, got float
    How can I solce it ?

    1. Did you find a solution for this?

    2. For me also getting same error. anyone know answer please update

    3. I was able to get it to draw, You need to change line 182 from:

      to :

      Notice the extra int() on all the parameters.
      I am using python 3, so I did: sudo pip3 install pynum, pillow ... and then changed the function. Using python 3, you also need to update the print statements by encapsulating them with ()

      (pillow is the replacement for PIL, works the same)

  11. Hi,
    I had tried to cluster a text dokument (.txt file) by keywords occurrences,
    but It doesn't work. didn't worked for text input ?

  12. cool job! i was searching for something like this, where i could use some user-defined distances. I was hoping that scipy.hierarchi could help, but in the end i found out that some functions there were compiled with some other languages instead of python, so i gave up and turned to your code, which in the end worked well for my own purpose. I didn't get your drawing part, but I changed the output of the while loop into a matrix, which can fit into scipy.hierarchical.dendrogram, the visualization is also great. Thanks anyway!

  13. And one more word, I was actually confused by you saying "weighted average", so I checked some papers about hierarchical clustering, and by Lance-Williams formula, I think the way you compute "mergevec" belongs to "Median" method, it is kind of "weighted" compared to another method called "Centroid". I got confused because there is also another pair of methods defined by the same formula, called "Unweighted Pair Group Method using
    Arithmetic mean (UPMGA)" and "Weighted Pair Group Method using
    Arithmetic mean (WPMGA)". The way to compute "mergevec" by these two ways would take average over all the members in the cluster, instead of divided by 2.