A constant source of nagging doubt in system design is the use of parametric probability density functions. For the most part it is better to use empirical distributions i.e. use the data itself to determine the liklihood of a given future occurrance. Of course, this doesn’t protect you from the Black Swan which, by definition, is an event that is NOT in the historical data. It does, at least, give you some handle on the fat tails.

R has a couple of nice functions, “quantile” and “ecdf” (empirical cumulative distribution function) that are invaluable if you take this approach. Following is a quick run-down of quantile and ecdf and a couple of ways of employing them.

## Quantile Function

quantile(x, probs = seq(0, 1, 0.25), na.rm = FALSE, names = TRUE, type = 7, …)

- “x” is the data you want to analyse – a vector of values
- “probs” is the how you want to analyze x – the default gives max, min and quartiles
- “names” tells the function to give names to the elements in the returned vector – set it to FALSE for speed
- “type” takes a value from 1-9 and specifies how to calculate the percentile value for each item in x and how to calculate the value of x for each value in “probs”

Following are some notes on type:

If x takes discrete values, use one of types 1-3, otherwise use one of types 4-9. Picture the process as simply sorting the list of values in ascending order, and counting along the list to get the percentile required.

- Type 1 is simple: take the first value such that the count equals or exceeds the proportion required. e.g. the 25th percentile of 20 items is the 5th in a sorted list: 0.25 x 20 = 5. It would be the 6th if there were 21 items: ceiling(0.25 x 21) = 6.
- Type 2 is the same except you average an item with the next in the list if you “land” on a value: 0.25 x 20 = 5 => average the 5th and 6th.
- Type 3 is trickier: if you land exactly mid way between two values, you use the nearest even-ranked value. e.g. 0.25 x 21 = 5.25, use 5; 0.25 x 22 = 5.5, use 6; …. 0.25 x 26 = 6.5, use 6; 0.25 x 27 = 6.75, use 7.

In my case, I am looking at price ranges vs a reference point over a fixed number of days – a continuous variable. But which of types 4 – 9 should I use? Method 5 is industry-specific, so let’s dispense with that!

Next, if you are using a large number of data points it does not really matter which type you use. The percentile values assigned to a data-point’s ranking, k, in a sample of size n is as follows:

- Type 4: k/n
- Type 6: k / (n+1)
- Type 7: (k-1) / (n-1)
- Type 8: (k – 1/3) / (n + 1/3)
- Type 9: (k – 3/8) / (n + 1/4)

If k and n are large, the values will be close. Even for large n, when k is small there will be large discrepencies – but how often are you in the 0.1th percentile? 1/1000th of the time! Occurrances in the 10th or 90th percentiles should probably be handled on an “exception” basis anyway. Type 4 always assigns the highest percentrank (7 is the lowest at low k, 6 at high k) – absolute difference between them will never exceed 1%.

Why are they different? Generally each is unbiased wrt a particular statistic. 6 is unbiased wrt the mean of the estimate of the quantiles, 7 for the mode, 8 for the median and 9 produces unbiased estimators of the ranking statistics if the underlying distribution is normal.

Since I am using this in conjunction with the empirical distribution function, I use type 4 which is a linear interpolation of the empirical distribution.

## ECDF Function

It took me a while to understand that the result of passing a data set to ecdf is a function.

ecdf(x)

“x” is a vector of data for which the empirical cumulative distribution function is required. Essentially, it creates a step function to which you can pass variables. The function returns the probablity of that value’s occurring given the set of data “x”.

For example, passing the data x <- 1:10 to ecdf and passing “3” to the resulting function: ecdf(x)(3) will return 30% as the result. There is a 30% chance of a value of 3 or less occurring based on a sample population {1,2,3,4,5,6,7,8,9,10}.

Another way of showing this that may be clearer is as follows:

x <- 1:10

fn <- ecdf(x)

fn(3)

0.30

## Chisq Testing

We can combine the quantiles function and the ecdf fuction to allow us to compare two empirical distributions via a chisq test. In my case I want to determine if a sub-set of a time series differs significantly from the entire series – I would rather analyze the whole series than introduce a look-back parameter.

Run a quantile analysis of the entire time series, bearing in mind the size of the sub-set and the requirement for a count of at least 5, and preferably 10, in each bucket of the distribution. For instance, a sub-set of 200 should work with percentiles divided into buckets 5% wide (5, 10, 15, etc):

percentiles <- seq(0.05, 1, by=0.05)

expected.distribution <- quantile(population, probs=percentiles, names=F, type=4)

population contains the data for the entire population

expected.distribution is a vector of the values (or breaks) defining each of the percentiles. e.g. 5% of values selected at random from this population should be less than or equal to the first value, 10% <= the second, etc.

subset.size <- length(observations)

observed.distribution<- ecdf(observations)(expected.distribution)

observed.counts<- subset.size * c(observed.distribution[1],

observed.distribution[-1] – observed.distribution[-length(observed.distribution)])

By passing the breaks from our quantile analysis of the population to the ecdf function generated from the observed data, ecdf(observations), we get back a vector of the cumulative distribution of the breaks. In a perfect world they would match the original percentiles. i.e. the expectation that a random value in the sample is less than or equal to the 5th percentile break in the population distribution should be 5% if the distributions are identical.

We figure the observed counts in each bucket by converting the cumulative density function to a probability density function by subtraction – we have to add back in the first “bucket” by concatenation, then multiply by the total sample size.

expected.counts <- rep(subset.size / (length(percentiles) – 1), length(percentiles) – 1)

chisq.pstat <- chisq.test(observed.counts, p=expected.counts, rescale.p=T)[[3]]

Because our original distribution is in the form of percentiles each bucket has the same expected count, thus we can easily calculate the p-statistic for the chisq test. And there you have it.

## Share

If you found this post informative, please share it!