In order to “cluster” observations we need some measure of how alike or different they are. This leads naturally to the concept of a distance between observations. In my last post about clustering I stated the following:
An observation is a vector of values, not necessarily of the same type, associated with the object which is to be clustered. They might be of the following types:
- Numerical: e.g. 4’8″, 6’4″, 5’10” – if we placed values on a scale we could visualize distance.
- Ordinal: e.g. 1st, 2nd, 3rd – the ordering matters (e.g. 1st is closer to 2nd than to 3rd, but we don’t know anything about how much closer).
- Binary: True / False – the feature is either there or it is not.
- Categorical e.g red, blue, green – there is no ordering to the categories.
Note that datasets may include different data types, complicating the distance calculation.
Let’s dig into the concept of “distance” a little more ….
Distance Matrix
Imagine we have a set of observations and we want a compact way to represent the distances between each pair. The obvious choice is to create a “distance matrix”. In all the following discussions that is what we are working towards. Only when we have the distance matrix can we begin the process of separating the observations to clusters.
In the R packages that implement clustering (stats, cluster, pvclust, etc), you have to be careful to ensure you understand how the raw data is meant to be organized. For stats and cluster, for example, data is organized with each observation on a separate row, and the variables as separate columns. In pvclust, for example, it is transposed. A simple test is to make up some data in a matrix or data frame that is not square, and test it to see if you get the expected results.
If we have, say, 5 observations of a single variable: {70, 99, 81, 29, 10} we can derive a distance matrix as follows:
1 | 2 | 3 | 4 | |
2 | 29 | |||
3 | 11 | 18 | ||
4 | 41 | 70 | 52 | |
5 | 60 | 89 | 71 | 19 |
The interpretation is pretty straight forward: each row and column give a distance between two observations, e.g. observations 1 and 2 are 70 and 99 respectively, the distance between them is |70 – 99| = 29. An easy way to picture this is a straight line with a scale of 0 – 100 and a dot for each observation. For n observations there are n.(n – 1)/2 distances.
But what are the units of distance? What if there had been more than one variable per observation? What if one variable covered a much wider range than the other? What if the variables had quantities that differed by orders of magnitude? What if the observations had not been numerical but had been ordinal, binary or categorical?
Hopefully, some or all of these questions will be answered below.
Numerical Data
Numerical data is typically divided into two sub-groups:
- Interval Scale
- Ratio Scale
The difference is easy to detect: ask if there is a meaningful “zero” to the scale. For example, temperature in degF or degC is interval scale – the “zero” value is arbitrary – while temperature in degK is ratio scale. Another useful question is to consider if the ratio between two values is meaningful: is 20degC twice as hot as 10degC? is 100kg twice the mass of 50kg. The answer to the first question is ‘no’, the second is ‘yes’.
Fortunately, whether using interval or ratio scales, the absolute distance between two values doesn’t depend on whether the ‘zero’ value is meaningful or not. So, just as TdegC and (T + 5)degC differ by 5degC for all values of T, Wkg and (W + 15)kg differ by 15kg for all values of W.
Multiple Variables
Typically, the data is more complicated than a set of observations of one variable, there may be many variables. Let’s start with a simple case of 4 observations of coordinate locations in 3 dimensions:
- (2, 3, 4)
- (3, 4, 2)
- (5, 7, 2)
- (3, 6, 5)
We could do the obvious and calculate the distance between each pair using a Euclidean distance to get the following distance matrix:
1 | 2 | 3 | |
2 | 2.45 | ||
3 | 5.39 | 3.61 | |
4 | 3.32 | 3.61 | 3.74 |
In this case we calculated the distance, d, between any pair of observations i, j, as
Where xiv is the value of variable, v, for observation, i.
In fact we can generalize this approach to quantifying the distance between two observations as follows:
This is generally referred to as the Minkowski distance. When p = 1 it is the same as the Manhattan or city block distance, p = 2 gives the Euclidean distance and p = infinity gives the supremum distance (i.e. the largest across all the variables, v). The “absolute” value is needed as raising negative distances to an odd value of m would result in taking roots of negative numbers. So which should you use? You’re going to use Euclidean distance. It makes intuitive sense for a reason. It is unlikely you will ever see anything used other than Manhattan or Euclidean. I would only use Manhattan if there was a compelling reason – such as the real-word constraints of the problem require it.
Normalizing
The example above was presented with a real-world relationship as it was coordinates in three dimensional space: the distance matrix represents a physical thing we can all understand. How do we compute distances in a meaningful way if variable 1 was height, 2 was weight and 3 was weekly grocery bill. Let’s say the sample data looked like this:
- (2, 3, 400)
- (3, 4, 200)
- (5, 7, 200)
- (3, 6, 500)
The resulting distance matrix (using p = 2, Euclidean distance) is:
1 | 2 | 3 | |
2 | 200 | ||
3 | 200 | 3.61 | |
4 | 100 | 300 | 300 |
The large numerical values of variable 3, have swamped out the values of variables 1 and 2. When we come to cluster the data using this distance matrix, the result will largely be a function of variable 3. We can solve this (if indeed it is a problem) by normalizing the values of the variables across all the observations. We can accomplish this by a number of approaches:
- subtract the column mean from each observation and divide by the standard deviation
- subtract the column mean from each observation and divide by the mean absolute deviation
- subtract the column median from each observation and divide by the median absolute deviation
Going down the list, each method is more robust than the previous and less prone to the influence of outliers. Employing the first approach above yields the following distance matrix, which is much closer to the matrix that was produced before multiplying every variable 3 observation by 100:
1 | 2 | 3 | |
2 | 1.65 | ||
3 | 3.50 | 2.29 | |
4 | 1.94 | 2.28 | 2.61 |
If any of the variables covers orders of magnitude from smallest to largest, it might help to take logs of those variables before normalizing as above. This would have the effect of making the distance between a value of 2 and 4, the same as the distance between a value of 20 and 40.
In all cases it is wise to think about the data and what it means to apply any of these data conditioning steps – your judgment will tell you if they make sense or not. Context rules!
Ordinal Data
Ordinal data poses its own problems in that the ordering of some data doesn’t tell us much about how far rank 1 is from rank 2 and so on. Purists will say that you should never treat ordinal data as if it were interval (numeric) data. If you are comfortable, then go ahead and use it that way. For example, grades A through F could be treated as 1 through 6 provided there are sufficient members in the class that each grade is awarded to a few students. Individual class rankings might be more problematic: chances are the difference between 1st and 2nd rank are greater than between 30th and 31st. Maybe not – context rules as I said above.
There are a couple of approaches to dealing with ordinal data (which is common in social sciences – you’ve done those surveys where you are asked if you are satisfied or VERY satisfied). One approach is to use normalized ranks, Spearman distance or Footrule distance – which amount to treating ordinal data as interval data. Then there are Kendall, Cayley, and Ulam distances which are all variations on counting the number of steps needed to edit one set of rankings into the other (edit distances).
Here is a link that contains a good review of the options: Kardi Teknomo. I try to summarize the main points below:
Treating Ordinal as Interval
- Normalized Ranks: scale the ranking score to an interval of [0,1]. Let’s say the ordinal variable can range from -3 to +3. -3 becomes 0, -2 becomes 1/7, …, +3 becomes 1. Then proceed to measure the distance using the Euclidean distance discussed above.
- Spearman Distance: in this case, the data is typically mutually exclusive. For example, a set of options is ranked so that for a given observation, one variable has the value “1”, another has the value “2”, and so on. The value “1” appears for only one variable per observation. The distance between 2 observations is the sum of the squares of the differences in each variable. Note this is NOT the Euclidean distance.
- (Spearman) Footrule distance: same as above but the distance is the sum of the absolute differences in each variable (a Manhattan distance between the observations).
It is not clear to me whether the Spearman measures above must be for unique rankings only. Furthermore, it seems to me that the three methods above really only differ in the choice of Euclidean, sum of squares or Manhattan.
Edit Distances
- Kendall Distance: Count the MINIMUM number of swaps of adjacent pairs of variables that would have to be made to transform one observation into another. This can only work on a unique ranking system, otherwise it might be impossible to transform one observation into another. For example, Obs1 = {1,2,3,4,5}, Obs2 = {3,2,1,5,4} have a Kendall distance of 4. It takes 4 swaps to get from Obs2 to Obs1: {3,2,1,4,5}, {3,1,2,4,5}, {1,3,2,4,5}, {1,2,3,4,5}.
- Cayley Distance: Count the MINIMUM number of swaps of any pair of variables that would have to be made to transform one observation into another. This can only work on a unique ranking system. Note the Kendall distance is always greater than or equal to the Cayley distance. For example, Obs1 = {1,2,3,4,5}, Obs2 = {3,2,1,5,4} have a Cayley distance of 2. It takes 2 swaps to get from Obs2 to Obs1: {3,2,1,4,5}, {1,2,3,4,5}.
- Ulam Distance: For any two observations, what is the MINIMUM number of delete-shift-insert operations necessary to match the two. For example, Obs1 = {1,2,3,4,5}, Obs2 = {3,2,1,5,4} takes 3 steps to make Obs2 identical to Obs1: {3,2,1,4,5}, {1,3,2,4,5}, {1,2,3,4,5}.
The tricky thing with these approaches is figuring the minimum number of edits necessary!
Share
If you found this post informative, please share it!