If you do anything that depends upon correlation, you owe it to yourself to be aware of outliers and how to deal with them. For example, a mean-variance optimized portfolio using Modern Portfolio Theory or the Black Litterman model is critically dependent upon, and very sensitive to the portfolio covariance matrix. In turn, the covariance matrix is extremely sensitive to outliers (leverage points).
This post was inspired by a posting on the Tr8dr Blog. The original source paper in which the details of the algorithm are specified is “A Fast Algorithm for the Minimum Covariance Determinant Estimator” Rousseeuw, Van Driessen, 1998. This paper is an implementation of the ideas detailed in “Least Median of Squares Regression”, Rousseeuw, 1984. Both papers are easy to find.
In this post I hope to add a little extra color, detail and exploration to the original sources above.
Motivation: An Example
Let’s start with an example that motivates the topic. On the right is a scatter plot of 5 years of monthly returns for one of QIM’s funds and one of Quantica’s. The data came from the IASG Website.
In red is a tolerance ellipse using a chisq value of 97.5% for 2 degrees of freedom for the entire dataset. That means we expect 97.5% of the distribution of points to lie within the ellipse; the ellipse dimensions come straight out of the covariance matrix (its Eigen Values and Eigen Vectors) which in turn arises from a least squares fit of the data. For more, refer to my post on covariance ellipses.
The green broken line ellipse is the zero correlation ellipse. It is there for comparison so you can see how close to zero the sample correlation (in red) is.
The blue ellipse is what we get when we remove just 4 outliers out of the 60 data points and recalculate the covariance matrix and 97.5% tolerance ellipse. The estimated correlation coefficients changed from approximately zero to 0.36! This is why outliers are known as leverage points. In fact, I like to think of an outlier as any point that exerts undue influence on my analysis. Imagine the effect this would have on the mean-variance optimized portfolio!
My approach was naive – I simply picked out the extreme values and re-ran the analysis. This is relatively safe with 2 variables since it is easy to visualize, but it would not be possible in 3 or more dimensions. Furthermore, even with this approach how do I know I got the outliers? There’s a subtle and interesting problem going on here, a meta-problem if you like: If I remove an outlier I affect the location and shape of the tolerance ellipse for the remaining data points: my choice of outliers may change my choice of outliers! This is the challenge Rousseeuw sought to address.
Rousseeuw’s Approach: Minimum Covariance Determinant
The conventional Pearson covariance approach above uses a least squares fit. As Rousseeuw points out it is actually “least sum of squares”, but “sum of” is dropped as if summation is the only thing one could do. If you think about it for a second, this is also a least mean of squares fit. A good place to start for “robustifying” statistics is to use medians rather than means, hence the title of Rousseeuw’s 1984 paper.
Here’s the basic idea: imagine your dataset is made up of “data” and “contamination”, we want a method that can reliably estimate the covariance of the “data” with up to (just less than) 50% contamination. So we find the smallest tolerance ellipse that contains at least 50% (the median) of the data. The ellipse will thus be robust to variation in the just less than 50% of data outside the ellipse. Using the conventional Pearson covariance, contamination by a single datapoint in the sample is enough to make the estimate unreliable.
The area of an ellipse is a.b.pi where a and b are the semi-axes. a and b are the absolute values of the eigenvalues of the covariance matrix of a 2 variable dataset. a.b is also the absolute value of the determinant of the covariance matrix. It follows that to minimize the area of the ellipse we need only minimize the absolute value of the determinant of the covariance matrix. This idea readily extends to 3 (tolerance ellipsoid) and higher dimensions where the volume of the ellipse is a constant times the product of all the semi-axes.
My immediate thought was “Why 50% of the data?” What if my contamination is much lower than that, aren’t I wasting valuable data? In the 1998 paper, Rousseeuw no longer talks about 50%, but about a portion of the data “h”, whose covariance is to be considered “robust”.
where: n is the size of the dataset, p is the number of dimensions. The user sets “h”.
Rousseeuw suggests defaulting to floor(0.75 x n). The breakdown value (i.e. the limit on the amount of contamination before the estimated covariance becomes unreliable) is (n – h + 1).
To summarize the general approach: search the dataset of size n, for the sub-sample of size h whose covariance matrix has the lowest possible determinant.
The FastMCD Algorithm
Begins with an initialization phase:
- Select a random sub-sample (without replacement) of size (p + 1)
- Compute Told and Sold, the mean and covariance of the sub-sample
- If det(Sold) = 0, add another random data point and repeat 2
Then there is a “concentration” phase:
- Compute the Mahalanobis distances (MD) for the entire data set based on Told
- Create a new sub-sample from the h data points with the smallest MDs
- Compute Tnew and Snew, the mean and covariance of the sub-sample
- Compute Det(Snew)
- If the new determinant is equal to the old then move on to summary phase
- If the new determinant is smaller than the old then Told = Tnew and Sold = Snew, and repeat 1
The summary phase:
- If this is the smallest determinant so far, Tbest = Told and Sbest = Sold
- Go back to initialization phase until the cycle has been repeated, say, 500 times
- If required, display MDs vs sample index (Distance-Index or DI-Plot)
- If required, display MDs based on Tbest (Robust distances) vs MDs based on Tsample (Standard distances)(Distance-Distance or DD-Plot)
A key thing to note, as with many numerical methods, is the two user-defined parameters, h and the number of cycles. I found very little consequence from changing the number of cycles and, at this point, have settled on a default of 100. The choice of h is another matter and the subject of my next post.
Rousseeuw’s description of the algorithm involves a whole lot on handling large datasets by breaking them down into sub-sets. In addition, he takes an approach of calling off the search for the minimum determinant after a couple of trials saving up the sub-samples with the best prospects for a thorough search. Since my goal was understanding rather than speed, I shall save these steps for the future.
Back to the Example Above
On the right is the scatterplot now implementing FastMCD.
Notice additional outliers have been identified vs the original and the estimate of correlation is now in the region of 0.55 – higher still than manually removing outliers. It is important to understand the “outliers” in the chart are NOT all the points other than those in h. I used the default h = 0.75 x 60 = 45, implying, perhaps, 15 outliers. That is not how it works. The 13 outliers in the chart are identified using the robust (blue) tolerance ellipse. They are only outliers in the sense they are outside the 97.5% tolerance ellipse, and obviously we expect 2.5% of the dataset to be there! There were an additional 2 points excluded from the covariance calculation.
Essentially, the algorithm has done a better job of allowing for the difference in variances of the two managers when selecting outliers – than I did by eye. All 4 of my points are in red, but so are an additional 9 points.
DI-Plot of Mahalanobis Distances
The next plot is the DI-Plot showing the robust Mahalanobis Distances (those based on the robust analysis) of each datapoint in time order. Notice there are more to the left than the right – there is autocorrelation that needs to be investigated (see the next post). The horizontal line is the chisq tolerance at 97.5%, 2 degrees of freedom, it is essentially the blue tolerance ellipse transformed into the time domain. You can count 13 points above that line.
DD Plot of Mahalanobis Distances
The final plot is the DD-Plot. The lines dividing the plot represent the chisq tolerance at 97.5%, 2 degrees of freedom. Again, they are essentially transformations of the tolerance ellipses. The horizontal line is the robust, the vertical the conventional. You will notice that there are 2 points beyond the conventional tolerance line, just as there are 2 points outside the red ellipse in the scatterchart. There are 13 points above the robust line just as there are 13 outside the blue ellipse in the scatterchart.
Notice that the points are “sprayed” above and to the left of the diagonal that indicates robust MD = conventional MD. This shows that the robust MDs are generally higher than the conventional. This is to be expected as the robust method is not allowing the outliers to influence the choice of mean and covariance. The conventional approach has found a solution that sacrifices robustness to accommodate those outliers. There is the appearance that there are two sprays, perhaps related to the changes in the DI-Plot, suggesting further investigation is warranted.
There is a lot more to share on the topic of the FastMCD, especially on the effect of sample size and selection of h when looking at time-series data. This will be the subject of my next post. All-in-all I think this material demonstrates the value of robust statistical analysis and provides a fairly easy way of implementing the Minimum Covariance Determinant method.
Edit 2012-01-24: added “, repeat1” to The FastMCD Algorithm, Concentration step, item 6.