The problem of categorization presents itself in many aspects of life. When marketers talk of “market segmentation” they essentially mean: “what categories of customer are there?” When investors talk of “asset classes” they essentially mean: “what categories of investment are there?”. In the past, when facing the problem of categorization, I would typically look at the universe of things I needed to categorize and make fairly arbitrary judgements to put them into categories. Cluster analysis rationalizes the process by analyzing the data with an open mind: it is an unsupervised learning process.
So what is a cluster? It is a group of observations that are similar to each other and dissimilar to observations in other clusters. Let’s start with some basics …
When I first started looking at clustering it took me a while to realize that there are really two main issues:
- How far apart are any two observations?
- Given (1) how do we put the observations into groups?
Observations and Distance Measures
An observation is a vector of values, not necessarily of the same type, associated with the object which is to be clustered. They might be of the following types:
- Numerical: e.g. 4’8″, 6’4″, 5’10” – if we placed values on a scale we could visualize distance.
- Ordinal: e.g. 1st, 2nd, 3rd – the ordering matters (e.g. 1st is closer to 2nd than to 3rd, but we don’t know anything about how much closer).
- Binary: True / False – the feature is either there or it is not.
- Categorical e.g red, blue, green – there is no ordering to the categories.
I plan on putting up a short post on distance measures that can be applied in each of the above cases. For the time being, suffice to say that we start out with a data frame of observations (each observation on a row, the values for each factor in columns) and end up with a triangular distance matrix showing the distances between every possible pair of observations.
- Partitioning e.g. k-means: The entire data-set is partitioned into a set of k clusters. An iterative process is used to discover a stable clustering which minimizes some objective function.
- Hierarchical: Start with one big cluster and split it repetitively using an objective function to determine the split or start with every observation in a cluster of size one and join together the two closest, repeating until there is only a single cluster.
- Density-Based: K seed observations are selected. For each, the nearest neighbors are attached to form the clusters rather like stepping from stone to stone when crossing a stream. The newest addition to the cluster may be closer on average to some other cluster, but its closest neighbors are part of this cluster. This allows for clusters that are not convex (see illustration below).
- Probabilistic: A model of the k clusters each with an unknown mean and variance is proposed. K observations are randomly selected as the means. The model is updated using the data to better estimate the means and variances. The process is repeated until the maximum likelihood estimate of the mean and variance of each cluster is reached.
Representations of Clusters
Various ways of representing clusters have been devised.
For hierarchical clustering, where there are not too many observations, dendrograms are used such as that on the right. The basic intuition is that items 1 and 2 were the closest together with a distance of 20 units. The linkage between the cluster formed by 1 & 2 and the cluster formed by 3, 4, & 5 is around 58 units.
The dendrogram doesn’t answer the question “How many clusters were there?”. Fortunately there are ways for us to estimate which of the branches in such a diagram are reliable and which are not.
- For many clustering approaches there are ways of using bootstrap resampling to estimate p-values of the clusters. This provides some idea of how robust the clusters are.
- Cluster diameter, D, is the distance between the two observations within the cluster that are the furthest apart.
- Cluster separation, S, is the smallest distance between any observation within the cluster and the any observation not in the cluster.
- Cluster isolation, L, is based on the S, D and the distances between every possible pairing of the members of two clusters.
- There are coefficients that measure the ratio of linkage between a given observation and its cluster and the linkage of the most distant clusters. There is an agglomerative and a divisive version.
- Silhouette Width is a measure for each observation of how similar it is to the members of its own cluster vs how dissimilar it is from members of the next nearest cluster.
Things to Remember
Pictures of Clusters
If there is a nice picture of some clusters such as that at the right, it is most likely not a realistic representation unless the clustering was truly done on two dimensions. Sometimes the axes mean something, sometimes the space in which the clusters are drawn is purely conceptual. Sometimes a diagram will have been produced after some dimensionality reduction process has been applied (akin to principal component analysis), so only the most significant 2 dimensions are presented. This can give rise to what looks like overlapping clusters when in fact you are missing other dimensions on which the separation is taking place. The easiest analogy is to imagine looking down on a scene from above: perhaps it looks like the bus is going to crash into the airplane but they are actually separated by vertical space.
Missing Observations Can Change Clustering
The presence or absence of a particular observation can dramatically change the resulting clustering. It is not a good idea to work with partial data sets. This is particularly the case with hierarchical methods where, at each step, cluster medians get changed. Always include as many observations as is practical.