As part of my procedure for back-testing, I validate the data before using it. One of the validation steps is to check for unusually large daily changes in the raw close. Here I explore the extreme value distribution and its role in data validation protocols for systematic trading system design.

This post is part of a series including:

- Garbage In => Garbage Out
- Creating Good Price Series
- Data Validation
- This post
- Some additional protocols.

The changes on rollover days form a sub-set of daily changes – obviously they have the potential to be substantially different from regular daily changes.

Daily price changes are not normally distributed; typically log-normal is used. Even so, in a series of 4,000 daily price changes, there will often be several 4-sigma deviations (which should only happen less than 1 in 10,000 times).

## Can the Generalized Extreme Value Distribution (GEV) help?

Wikipedia has a good page on this topic and there are several sets of lecture notes available on the web to help out.

This distribution is a model for the **extremes** in the data. For example, in a month of log price changes, there will be two extremes: a maximum and a minimum. Next month there will be another maximum and a minimum. These monthly maxima and minima form two new data sets. The analyst can attempt to fit the GEV to each subset of extrema.

The cumulative distribution function:

$latex F(x; \mu, \sigma, \xi)=e^{-t(x)} $

Where:

$latex t \left ( x \right ) = \left \{ \begin{matrix}

\left( 1 + \xi \left( x – \mu \right) / \sigma \right)^{\left( -1/\xi \right)} & \xi \neq 0 \\

e^{- \left( x – \mu \right) / \sigma} & \xi = 0 \end{matrix} \right. $

μ is the “location” parameter – (**not** the mean)

σ > 0 is the “scale” parameter – (**not** the standard deviation)

ξ is the “shape” parameter

and

$latex 1 + \xi \frac{\left ( x – \mu \right )}{\sigma } > 0 $

The shape parameter ( ξ = 0, ξ > 0, ξ < 0 ) divides up the GEV into “families” of distributions (Gumbel, Frechet, Weibull). The choice is driven by the shape of the underlying distributions of the data from which the extrema were derived. Consider, for example, mean time between failures in electronic devices which tend to fail early or not at all versus failure rates in mechanical devices which increase over time as a result of fatigue.

I put together a spreadsheet to noodle around with the distributions to see how they work – I will make it available upon request. To get a feel for the distribution it is most informative to look at the probability density function:

$latex f \left ( x; \mu, \sigma, \xi \right ) = t \left ( x \right )^{\left ( \xi + 1 \right )}e^{-t \left ( x \right )} / \sigma $

The following are some charts produced in R using the EVD package examining a distribution for maxima:

## Gumbel Distribution I

This is the Gumbel distribution (xi = 0) varying the location parameter. The distribution simply moves left or right without changing in any other way. For a distribution of minima, imagine a mirror image about the location parameter. Note that the distribution has a positive (right) skew.

## Gumbel Distribution II

This is also the Gumbel distribution but the scale parameter is varied. The “mode” of the pdf is unchanged. The larger the scale, the more spread out along the x-axis the pdf becomes. The shape of the pdf is still unchanged, it is simply stretched or compressed.

## Weibull Distribution

This is a Weibull distribution (xi < 0). It starts out skewed right and ends up skewed left. All the pdf’s have the same value when x = μ.

This is the type of distribution found with fatigue-type failures in physical systems. It is also what I found when examining monthly maxima and minima in log daily price changes (xi in the range 0 to -.5).

## Frechet Distribution

Finally, this is the Frechet distribution (xi > 0 ). All the pdf’s have the same value when x = μ. A point of interest: for xi > 1, the distribution has no mean, and for xi > 1/2, it has no variance!

This is the type of distribution found in failures in electronic components.

So how is this all used in data validation? I create a data set of the monthly maxima and minima for each price series (approx. 15 years 12 months/year gives 180 maxima and 180 minima for each series). For each price series, I use the EVD package (function: fgev) to fit the extreme value distribution to the data i.e. calculate the three parameters: μ, σ, ξ for the distributions of maxima and minima. I then calculate the probability, for every log daily price change, that it could have come from this distribution using the cumulative density function (Function: dgev). I then flag any values that should have occurred less than once in a dataset of the size used to generate the distribution (if there were 180 maxima in the original data then I look for P(maxima >= x) < 1/180).

Unfortunately, there is no way to apply a similar approach to rollovers as there are too few data points. Which brings up an interesting issue: how big a sample should I draw my maxima and minima from? Is monthly ok (samples of 20 or so) or can I go even smaller? How about 1?

The end result: I flag far fewer data points at much greater extremes than using the normal distribution. This means I am more likely to catch a real problem.

Final comment: I am hopeful I can apply this approach to other aspects of system design such as extremes in length of streaks, size of wins and losses, equity draw-downs, etc.

Rather interesting post.

I occasionally google unusual search items (Kalman Filter, gumbel distribution, etc.) in Google's blog search. You were the first hit for gumbel distribution!