Section 3.2 in Chapter 3 of “Silent Risk”, a draft of a book by Nassim Nicholas Taleb defines the “inverse problem” as follows:

Definition 3.4 (The inverse problem)

There are many more degrees of freedom (hence probability of making a mistake) when one goes from a model to the real world than when one goes from the real world to the model.

He thens brings the problem into sharper focus:

Principle 3.2 (Visibility of the Generator)

In the real world one sees time series of events, not the generator of events, unless one is himself fabricating the data.

He then goes on to illustrate the problem by posing the following problem: If one is provided with a set of data generated by a particular distribution (Pareto, say) and analyses it assuming a different distribution (Gaussian, say) how severe are the errors likely to be? Below, I provide the R scripts to allow this experiment to be simulated.

I look at:

- How far off is the apparent mean from the actual mean, and how frequently is it off in the “bad” direction i.e. the apparent mean is higher than the actual mean.
- How far off will the apparent variance be, and how often in the wrong direction?
- How far off is an estimated loss likely to be and how often will we underestimate the loss?

Reproducing (as best I can) the examples in the text, we set up the experiment as follows:

- Assume the “true” generating distribution for the data is a Pareto Type 4 Distribution (see the previous post) with a tail off to the left. The location parameter, μ = 0, the scale, σ = 1, the tail index, α = 1.5 and the inequality parameter, γ = 0.75.
- Generate 1,000 sets of data, each with 1,000 data points.
- Skip the problem of figuring out what the distribution might be from each data set. Our unwitting subjects will assume a Gaussian distribution.
- Have 1,000 unwitting subjects figure out the mean, standard deviation and estimated 5% loss of each data set, and see how good or bad their estimates are.

## Expected Value

Since we know the parameters of the generating function, we can figure out the true expected value:

The right tailed version:

The left-tailed version:

Where: The shape parameter α is the tail index, μ is location, σ is scale, γ is an inequality parameter. The default values match the examples Taleb uses in Section 3.2, with the exception that Taleb’s text indicates μ = 1, and σ = 2, but the figures are consistent with μ = 0, σ = 1.

In the present case (left tailed distribution), the expected value is approximately -1.27.

# Calculates the expected value of a Type 4 distribution. ePareto4 <- function(mu=0, sigma=1, gamma=0.75, alpha=1.5, right=F){ if (alpha <= gamma) return(NA) mu + ifelse(right, 1, -1) * sigma * gamma(gamma + 1) * gamma(alpha - gamma) / gamma(alpha) }

## Expected Variance

Since we know the parameters of the generating function, we can figure out the true expected variance: The first moment of the Pareto Distribution with location, μ = 0 (see expected value above):

The second moment of the Pareto Distribution with location, μ = 0:

So the variance (for all values of μ) is:

Where: The shape parameter α is the tail index, σ is scale, γ is an inequality parameter. The default values match the examples Taleb uses in Section 3.2. In the present case (left tailed distribution), the expected variance is infinite as α = 2.γ.

# Calculates the expected variance of a Type 4 distribution. vPareto4 <- function(mu=0, sigma=1, gamma=0.75, alpha=1.5, right=F){ if (alpha <= 2 * gamma) return(NA) sigma^2 * (gamma(alpha) * gamma(2 * gamma + 1) * gamma(alpha - 2 * gamma) - gamma(gamma + 1)^2 * gamma(alpha - gamma)^2) / gamma(alpha)^2 }

So now we need a script to generate a 1000 sets of 1000 variates:

# Generate 1000 series of 1000 random numbers with Pareto Type 4 distribution MonteCarloPareto4 <- function(M=1000, n=1000, ...){ matrix(rPareto4(M * n, ...), nrow=M, ncol=n) }

The actual probability density function looks like this:

Here are four randomly selected sets of 1000 variates:

So what do we find when we look at the 1000 sets of generated data (See very last script to generate similar results)?

- 55%+ of the samples have means higher than the “true” mean (See histogram of means).
- 72%+ of the variates are above the “true” mean (See PDF above).
- 100% of the samples have a standard deviation lower than infinity (See histogram of SDs)!
- 100% of the shortfalls estimated from the samples are below the actual expected shortfall (See script below for details).

## Calculating Shortfalls

Following is a script that calculates the expected shortfall for a given region of the tail for the Pareto Distribution. It uses a small increment to do a numerical integration from the location parameter to the percentile that begins the tail (to go the other way requires picking some arbitrary value for negative infinity), and uses the known expected value to derive the expected value contained in the tail:

shortFallPareto4 <- function(tail=0.01, increment=0.001, mu=0, sigma=1, gamma=0.75, alpha=1.5, right=F){ expectedValue <- ePareto4(mu=mu, sigma=sigma, gamma=gamma, alpha=alpha, right=right) if (right){ xRange <- seq(mu, qPareto4(tail, mu=mu, sigma=sigma, gamma=gamma, alpha=alpha, right=right), increment) eLoss <- (expectedValue - increment * sum(xRange * dPareto4(xRange, mu=mu, sigma=sigma, gamma=gamma, alpha=alpha, right=right))) / (1 - tail) }else{ xRange <- seq(qPareto4(tail, mu=mu, sigma=sigma, gamma=gamma, alpha=alpha, right=right), mu, increment) eLoss <- (expectedValue - increment * sum(xRange * dPareto4(xRange, mu=mu, sigma=sigma, gamma=gamma, alpha=alpha, right=right))) / tail } return(eLoss) }

Following is a script that does the same for a Gaussian, but here we have to make a substitution for +/- infinity, I use 1e-5 which is more than enough for a couple of significant digits:

shortFallGaussian <- function(tail=0.01, increment=0.001, mean=0, sd=1, right=F){ kay <- qnorm(p=ifelse(right, 1 - tail , tail), mean, sd) eLoss <- mean + ifelse(right, 1, -1) * sd * exp(-(kay - mean)^2 / 2 / sd^2) / sqrt(2 * pi) / tail return(eLoss) }

## The Script That Runs The Experiment

Following is a script that ties it all together, calculates all the results and generates all the charts above:

Note: See linked post for improved version

Playing with the various parameter values in the script is a good way to get a feel for what is important and what is not!

EDIT 04-05-2016: Made a correction to vPareto4() function and Silent Risk C3.R script. See the next post for the resolution of an issue with calculating expected loss and an improved version of the script.

## Share

If you found this post informative, please share it!