The problem of categorization presents itself in many aspects of life. When marketers talk of “market segmentation” they essentially mean: “what categories of customer are there?” When investors talk of “asset classes” they essentially mean: “what categories of investment are there?”. In the past, when facing the problem of categorization, I would typically look at the universe of things I needed to categorize and make fairly arbitrary judgements to put them into categories. Cluster analysis rationalizes the process by analyzing the data with an open mind: it is an unsupervised learning process.

So what is a cluster? It is a group of observations that are similar to each other and dissimilar to observations in other clusters. Let’s start with some basics …

When I first started looking at clustering it took me a while to realize that there are really two main issues:

  1. How far apart are any two observations?
  2. Given (1) how do we put the observations into groups?
I found it very important to keep straight in my mind whether the topic at hand was distance between observations or the similarity / dissimilarity of the clusters of observations. A trick I used was to try to consistently refer to the former as “distance” and the latter as “linkage”. For some reason it helped!

Observations and Distance Measures

An observation is a vector of values, not necessarily of the same type, associated with the object which is to be clustered. They might be of the following types:

  • Numerical: e.g. 4’8″, 6’4″, 5’10” – if we placed values on a scale we could visualize distance.
  • Ordinal: e.g. 1st, 2nd, 3rd – the ordering matters (e.g. 1st is closer to 2nd than to 3rd, but we don’t know anything about how much closer).
  • Binary: True / False – the feature is either there or it is not.
  • Categorical e.g red, blue, green – there is no ordering to the categories.
Note that datasets may include different data types, complicating the distance calculation.

I plan on putting up a short post on distance measures that can be applied in each of the above cases. For the time being, suffice to say that we start out with a data frame of observations (each observation on a row, the values for each factor in columns) and end up with a triangular distance matrix showing the distances between every possible pair of observations.

    Forming Clusters

    For the most part, forming the data into clusters is done by following some well-defined procedure. There are hundreds of them in the literature. Following is a conceptual overview of the main methods. Details may follow in future posts.
    • Partitioning e.g. k-means: The entire data-set is partitioned into a set of k clusters. An iterative process is used to discover a stable clustering which minimizes some objective function.
    • Hierarchical: Start with one big cluster and split it repetitively using an objective function to determine the split or start with every observation in a cluster of size one and join together the two closest, repeating until there is only a single cluster.
    • Density-Based: K seed observations are selected. For each, the nearest neighbors are attached to form the clusters rather like stepping from stone to stone when crossing a stream. The newest addition to the cluster may be closer on average to some other cluster, but its closest neighbors are part of this cluster. This allows for clusters that are not convex (see illustration below).
    • Probabilistic: A model of the k clusters each with an unknown mean and variance is proposed. K observations are randomly selected as the means. The model is updated using the data to better estimate the means and variances. The process is repeated until the maximum likelihood estimate of the mean and variance of each cluster is reached.

    Representations of Clusters

    Dendrogram showing cluster analysis resultsVarious ways of representing clusters have been devised.

    For hierarchical clustering, where there are not too many observations, dendrograms are used such as that on the right. The basic intuition is that items 1 and 2 were the closest together with a distance of 20 units. The linkage between the cluster formed by 1 & 2 and the cluster formed by 3, 4, & 5 is around 58 units.

    The dendrogram doesn’t answer the question “How many clusters were there?”. Fortunately there are ways for us to estimate which of the branches in such a diagram are reliable and which are not.

    Quality Measures

    Following are some basic measures of the quality of clusters produced by a cluster analysis (there are many more and new ones are being invented all the time):
    • For many clustering approaches there are ways of using bootstrap resampling to estimate p-values of the clusters. This provides some idea of how robust the clusters are.
    • Cluster diameter, D, is the distance between the two observations within the cluster that are the furthest apart.
    • Cluster separation, S, is the smallest distance between any observation within the cluster and the any observation not in the cluster.
    • Cluster isolation, L, is based on the S, D and the distances between every possible pairing of the members of two clusters.
    • There are coefficients that measure the ratio of linkage between a given observation and its cluster and the linkage of the most distant clusters. There is an agglomerative and a divisive version.
    • Silhouette Width is a measure for each observation of how similar it is to the members of its own cluster vs how dissimilar it is from members of the next nearest cluster.
    In a future post I will define some of these measures and others in more detail.

    Things to Remember

    Diagram of non-spherical clusters

    Pictures of Clusters

    If there is a nice picture of some clusters such as that at the right, it is most likely not a realistic representation unless the clustering was truly done on two dimensions. Sometimes the axes mean something, sometimes the space in which the clusters are drawn is purely conceptual. Sometimes a diagram will have been produced after some dimensionality reduction process has been applied (akin to principal component analysis), so only the most significant 2 dimensions are presented. This can give rise to what looks like overlapping clusters when in fact you are missing other dimensions on which the separation is taking place. The easiest analogy is to imagine looking down on a scene from above: perhaps it looks like the bus is going to crash into the airplane but they are actually separated by vertical space.

    Missing Observations Can Change Clustering

    The presence or absence of a particular observation can dramatically change the resulting clustering. It is not a good idea to work with partial data sets. This is particularly the case with hierarchical methods where, at each step, cluster medians get changed. Always include as many observations as is practical.

    Computing Resources

    Some clustering methods do not scale well and much research has been done to make these methods scaleable without reducing their effectiveness. Fortunately in anything I have done, this has not been an issue. Apparently, large-scale text analysis is one area where this is an issue.

    Share This


    If you found this post informative, please share it!

    New Commodity Pool Launches

    Please provide your name and email address so we can send you our monthly compilation of new commodity pools registered with NFA.

    You can unsubscribe at any time.

    Thank you! Your file will be available after you have completed a two-step confirmation process. Check your in-box for step 1.

    Biggest Hedge Funds By $AUM

    Please provide your name and email address so we can send you our monthly compilation of biggest hedge funds by $AUM as reported on the SEC's Form ADV.

    You can unsubscribe at any time.

    Thank you! Your file will be available after you have completed a two-step confirmation process. Check your in-box for step 1.