In an earlier post I was looking at distance measures or the dissimilarity of observations for clustering. In an even earlier post I had referred to analyzing hedge fund regulatory data using clustering to try to put the funds into groups by inferred strategy. I had to solve a problem with clustering that has being bothering me for a while: how do you measure dissimilarity observations when the data is sparse? In my case the problem is further compounded by order-of-magnitude differences in the values for one observation vs. another (a Pareto distribution).

### Background

I have downloaded from SEC’s IAPD website and from NFAs BASIC website a lot of information about the funds operated by 70 of the largest hedge funds according to the 2015 version of Institutional Investors Alpha Hedge Fund 100. My hypothesis is simple: Managers adopting similar market strategies (as distinct from trading strategies) will tend to offer similar funds in the marketplace and use similar names for them.

I dismantle all the names of the funds to create a dictionary of “fund words”. This is harder than it sounds – there is a ton of clean-up to do including filtering out meaningless words like brand names, numbers, forms of organization, etc., not to mention the outrageous number of spelling mistakes! For each manager I count up the number of times a word appears and also total up the $AUM associated with each word based on the AUM in the fund that uses the word.

For example, if Anchorage reports 6 funds with $2.8bn AUM with the word “CLO” in their names, 13 funds with $1bn AUM with the word “Credit”, etc. After all the filtering, Anchorage only ends up with about 13 meaningful words in its vocabulary.

My overall dictionary across all managers in my base case includes 512 words (my cases range from 50 – 1800 words). So you can see that it is likely that two managers might share only a few words in common. An alternative problem is that the managers have very similar vocabularies, but one may have an order of magnitude more $AUM associated with the same words. Using traditional distance measures like Manhattan or Euclidean will be dominated by the lack of overlap or the sheer overall AUM differences between them. This is the problem I have sought to solve.

I have come up with an approach that appeals to me, and I want to share it. First, I want to look at how we measure distance between observations when the data are categorical. Then I want to show how I think the categorical approach can be combined with numerical data that is sparse.

### Categorical Data

Categorical data means things like colors, symptoms, things where the data just indicates which bucket something is in, there is no size or ordinal information. The simplest way to picture this data is as a list with all the categories and each observation is a logical vector indicating true for the presence of the attribute and false for its absence. The preamble above seems all about numeric data, just bear with me – think “True” the manager uses a certain word or “False” he does not.

There are two methods for dealing with categorical data and it depends on whether it is significant that, for a given pair of observations, neither observation contains a particular feature:

- If the all the data is important, so the absence of category membership for both members of a pair of observations matters, use a “Symmetric Binary” measure of distance.
- If it only matters that at least one member of a pair of observations has a feature, but it is not important that neither pair has the feature, use an “Asymmetric Binary” measure of distance

### Symmetric Binary Distance

$latex \textbf{X}=\left \{ x_{i}\right \}, \textbf{Y}=\left \{ y_{i}\right \}\Rightarrow d_{X, Y} = \frac{count \left \{ XOR(x_{i}, y_{i}) = TRUE \right \}}{\mid \textbf{X} \mid}&s=2$

Consider two observations: (T, T, F, F, F, F, F, F, F, F) and (T, F, F, F, F, F, F, F, F, F), their distance will be 0.1 (on a scale of 0 – 1), because there is only one difference out of 10 elements.

### Asymmetric Binary Distance

$latex \textbf{X}=\left \{ x_{i}\right \}, \textbf{Y}=\left \{ y_{i}\right \}\Rightarrow d_{X, Y} = \frac{count \left \{ XOR(x_{i}, y_{i}) = TRUE \right \}}{count \left \{ OR(x_{i}, y_{i}) = TRUE \right \}}&s=2$

Consider the same two observations above, their distance will be 0.5 (on a scale of 0 – 1), because there is one difference out of the 2 elements that at least one of them exhibits.

To me, it makes sense to use asymmetric binary when the data is sparse and it is highly likely that many of the elements of the vector will be false, like when looking at data such as a vocabulary.

### Sparse Numeric Data

One way to evaluate the counts or $AUM associated with a word in a given hedge fund manager’s fund-naming vocabulary is to simply use asymmetric binary, but this seems like throwing away information. I came up with two variations on the asymmetric binary approach that attempt to take advantage of the benefits without wasting the information contained in the numeric data. I call the approach weighted binary, and the variations type I and type II. I don’t imagine I am the first to come up with this idea, but I haven’t found it documented anywhere (but I haven’t tried that hard either).

### Sample Data

I have put together a very simple data set to work with:

Observation | Element 1 | Element 2 | Element 3 | Element 4 | Element 5 | Element 6 |
---|---|---|---|---|---|---|

1 | 100 | 5 | 5 | 5 | 5 | 5 |

2 | 20 | 1 | 1 | 1 | 1 | 1 |

3 | 0 | 100 | 5 | 0 | 0 | 5 |

4 | 0 | 20 | 0 | 0 | 1 | 0 |

5 | 100 | 0 | 0 | 0 | 0 | 5 |

I hope it is clear in the context of my objective of trying to identify common strategies amongst hedge fund managers, that observations 1 and 2 are similar and observations 3 and 4 are similar. Observation 5 is a little different from the others but closest to 1 and 2. I am trying, in this example, to get across the sparseness (imagine that these vectors – horizontal rows – are sprinkled with another 500 columns of all zeros) and scale problems.

### Manhattan Distance

Here’s the clustering we get when we use manhattan distance (the sum of the absolute differences between the elements of the observation). The distances can be read off the vertical axis. Notice that 1 is paired with 5, and are the closest 2 observations. This has happened because of the dominance of the first element of each of these observations. 2 is paired with 4 and the “outlier” is 3!

### Asymmetric Binary Distance

When we try binary distance we get the pairing of 1 and 2 because all the elements of both observations are non-zero, but we are missing the pairing of 3 and 4. Furthermore, observation 5 should be closer to 1 & 2 but it couldn’t be further away! Because the binary approach ignores the magnitude of the values entirely, some of the structure in the information is lost.

### Weighted Binary

The formula for calculating this distance is as follows for Type 1 (Actual weights):

$latex \textbf{X}=\left \{ x_{i}\right \}, \textbf{Y}=\left \{ y_{i}\right \}\Rightarrow d_{X, Y} = \frac{\sum x_{i}+y_{i} \mid XOR(x_{i} > 0, y_{i} > 0)}{\sum x_{i} + \sum y_{i}}&s=2$

Essentially we add up the x and y elements where only one of them has a value, and divide by the total of all the x and y values. Obviously, this only works for positive data.

The formula for calculating this distance is as follows for Type 2 (Average weights):

$latex \textbf{X}=\left \{ x_{i}\right \}, \textbf{Y}=\left \{ y_{i}\right \}\Rightarrow d_{X, Y} = \frac{\sum \overline{x_{i}} \mid XOR(x_{i} > 0, y_{i} > 0)}{\sum \overline {x_{i}} \mid OR(x_{i} > 0, y_{i} > 0)}&s=2$

Here we are replacing the actual value of the element with the average value of the element across all observations. So we are weighting according to the global importance of the element rather than the importance within this specific pair of observations.

### R Function Implementing Weighted Binary Approach

The function returns an object of class “dist” which can be submitted directly to R’s hclust() function, or anything else that requires such an object.

# A distance measure blending both numeric and binary approaches. # Ian Rayner 2016-04-28 weightedBinaryDist <- function(data, weights=NA){ diss <- vector("numeric", length=nrow(data) * (nrow(data) - 1) / 2) # if weights specified use in place of data if (!is.na(weights[1])) { for (i in 1:nrow(data)) data[i, data[i, ] > 0] <- weights[data[i, ] > 0] } counter <- 0 for(i in 1:(nrow(data) - 1)){ for (j in (i + 1):nrow(data)){ counter <- counter + 1 newData <- data[c(i, j), data[i, ] > 0 | data[j, ] > 0, drop=F] diss[counter] <- 1 - sum(newData[ , newData[1, ] > 0 & newData[2, ] > 0], drop=F) / sum(newData) } } attr(diss, "Size") <- nrow(data) attr(diss, "Labels") <- rownames(data) attr(diss, "Diag") <- F attr(diss, "Upper") <- F attr(diss, "method") <- "wtdBinary" class(diss) <- "dist" return(diss) }

At some point in the future I will share some of the results from my work with the actual data for the 70 largest hedge funds.

EDIT: Added the R code section.

## Share

If you found this post informative, please share it!