We are going to play around with a mixture distribution made up of a large proportion of ~N(0, 1) and a small proportion of ~N(0, 1+a). The wider distribution is “polluting” the standard normal distribution. We are going to see that mean absolute deviation is a more efficient estimator of the distribution’s dispersion than standard deviation. We are also going to see some unexpected (by me, at least) behavior in response to the level of pollution of the base distribution, the dispersion of the polluting distribution, and the sample size.
Section 1.4 of the draft of Taleb’s “Silent Risk” is a discussion of “Robustness of Payoff or Robustness of Estimation” highlighting the weakness of “robust” statistics. We take a side trip (in 1.4.3 and 1.4.4) on the subject of standard deviation (SD) vs. mean absolute deviation (MD) as a measure of dispersion. This is clearly one of Taleb’s hobby horses as he submitted section 1.5 for the 2014 Edge Organization’s “What Scientific Idea Is Ready for Retirement?”.
Taleb is saying that we should use mean absolute deviation rather than standard deviation when discussing the dispersion of a distribution. He offers several technical reasons, but depends heavily on the point that mean absolute deviation is more in tune with natural intuition.
In the following, Taleb illustrates that if an apparently normal distribution is “polluted” with even a slightly heavy tail, then MD is a more efficient measure of dispersion than SD (meaning, for a given sample size, MD is likely to be closer to its true value than SD). I include R code so the reader can experiment for themselves. I also extend the discussion a little beyond section 1.4.4, by exploring the effect of the level of pollution and the sample size.
Measures of Dispersion
A Historical Side-Note
Taleb gives us a nice historical side note on how standard deviation (SD) rather than mean absolute deviation (MD) came to be the most-commonly used measure of dispersion. The famous statistician, Ronald A Fisher, argued for SD on the basis that if a population is normally distributed then it is a more efficient statistic than the MD (by approximately 12.5%!). But what if the population in question is not “perfectly” normal?
Asymptotic Relative Efficiency
First, let’s define asymptotic relative efficiency, the measure of the relative rate at which estimates of two different statistics approach their actual value as the size of the sample increases:
Where n is the sample size, var(.) is variance and mean(.) is, well, mean or expected value.
Small Deviations From Normality: SD is Not Very Efficient
Using R Utility for Mixture Distribution we can create samples from a normal distribution “polluted” with a small amount of samples from a distribution with a wider dispersion. The default values of rMixDist (my R function for creating mixture distributions) are for 2 normal distributions in ratio 99:1 (i.e. 1% pollution of the first distribution by the second). The parameter values for the first distribution are c(0, 1) i.e. the bulk of the random sample will be the standard normal distribution, ~N(0, 1).
To reproduce the example in Taleb’s text we can vary the jump size (i.e. the addition to apply to the standard deviation of the second distribution) from 0 to 20 in increments of 0.5. For interest I also looked at varying the sample size using 10, 50, 100, 500, and 1000. So the idea is we are going to run a MC simulation and estimate ARE for MD vs SD for various sample sizes for mixture distributions as follows: 99% ~N(0, 1), 1% ~N(0, 1.0); 99% ~N(0, 1), 1% ~N(0, 1.5); 99% ~N(0, 1), 1% ~N(0, 2.0); …; 99% ~N(0, 1), 1% ~N(0, 20.0). When we plot the estimated ARE vs the jump size for each sample size, we get the following:
The results make Taleb’s point: with small pollution (1%) of a perfect normal distribution MD rapidly becomes more efficient than SD. The larger the dispersion of the polluting distribution the more efficient MD becomes. This is what you would expect, but there is some interesting behavior that I didn’t expect:
- The ARE has a peak – after a certain point the ARE starts declining with increasing dispersion in the polluting distribution.
- The higher the pollution level the lower the ARE – see the chart for a pollution rate of 2%.
- The smaller the sample size the lower the ARE.
This is why I like to code up the examples in Taleb’s work to be able to tweak the parameters and see what happens.
Note 1: Taleb says he uses a sample size of 1,000. On my chart the ARE peaks out around 12 vs. his peak of around 8. I get a value of 8 for sample size 100 – a typo?
Note 2: The same conclusions do not necessarily hold for mixture distributions involving other distributions than the normal. You can use the code to investigate.
Some R Code
Following is the R code you can play around with. Note that it also produces results for the mean square error in the estimates of MD and SD. I did not discuss these results above. Suffice to say they do not change the conclusion that MD is a better estimator of dispersion for a slightly fat-tailed distribution.
# Exploring the ideas in C1 section 1.4.3 sizes <- c(10, 50, 100, 500, 1000) runLen <- 10000 lowA <- 0 hiA <- 20 aStep <- 0.5 p <- 0.01 results <- matrix(nrow=runLen, ncol=3, dimnames=list(NULL, c("Mean", "SD", "MAD"))) a <- seq(lowA, hiA, by=aStep) ARE <- matrix(nrow=length(a), ncol=length(sizes), dimnames=list(a, sizes)) biasMAD <- ARE biasSD <- ARE MSEMAD <- ARE MSESD <- ARE oldPar <- par(no.readonly=T) par("cex"=0.8) plot(min(a), 0, pch=3, xlim=c(min(a), max(a)), xlab="Jump Size (std. dev.s)", ylim=c(0, 15), ylab="Relative Efficiency", main="RE Mean Absolute Deviation vs. Standard Deviation", sub=paste0("Jump Probability = ", p), cex=0.8) colors <- rainbow(length(sizes), start=0.1) abline(h=1) devRE <- dev.cur() dev.new() plot(min(a), 0, pch=3, xlim=c(min(a), max(a)), xlab="Jump Size (std. dev.s)", ylim=c(0, 1), ylab="Mean Sq. Error", main="MSE Mean Absolute Deviation vs. Standard Deviation", sub=paste0("Jump Probability = ", p), cex=0.8) colors <- rainbow(length(sizes), start=0.1) devMSE <- dev.cur() for (k in 1:length(sizes)){ for (j in 1:length(a)){ for (i in 1:runLen){ sample <- rMixDist(n=sizes[k], pars=list(c("sd"=1), c("sd"=1 + a[j])), probs=c(1 - p, p)) results[i, ] <- c(mean(sample), sd(sample), mean(abs(sample - mean(sample)))) } ARE[j, k] <- (var(results[ , 2]) / mean(results[ , 2])^2) / (var(results[ , 3]) / mean(results[ , 3])^2) biasSD[j, k] <- mean(results[ , 2]) - sqrt((1 - p) + p * (1 + a[j])^2) MSESD[j, k] <- var(results[ , 2]) + (biasSD[j, k])^2 biasMAD[j, k] <- mean(results[ , 3]) - sqrt(2 / pi) * sqrt((1 - p) + p * (1 + a[j])^2) MSEMAD[j, k] <- var(results[ , 3]) + (biasMAD[j, k])^2 } dev.set(which=devRE) points(a, ARE[ , k], pch=20, col=colors[k]) dev.set(which=devMSE) points(MSESD[ , k], pch=18, col=colors[k]) points(MSEMAD[ , k], pch=20, col=colors[k]) } dev.set(which=devRE) legend(x=0.1, y=14, legend=sizes, pch=20, col=colors, title="Sample Size", bty="n") dev.set(which=devMSE) legend(x=0.1, y=0.9, legend=sizes, pch=20, col=colors, title="Sample Size", bty="n") par(oldPar)
Share
If you found this post informative, please share it!