I hate getting a result I don’t expect, but I like to cross-check every calculation I do, especially when I am relying on the calculation procedures contained in an R package. This is a recipe for “agita”.

I was recently exploring R’s autocorrelation function: acf(x, lag.max). I was applying it to the returns series for the last 5 years of Winton’s diversified fund – try it you might find it interesting. I cross-checked the results using R’s correlation function: cor(x, y). y was set to a lagged version of x i.e. y(t) = x(t – lag). I could not seem to get the results to match; they were close but not exactly the same.

The R documentation cites “Venables, W. N. and Ripley, B. D. (2002) Modern Applied Statistics with S. Fourth Edition. Springer-Verlag.” as the source for the code, but I don’t have it. So I had to do some sleuthing…

The standard formula for figuring the correlation coefficient of two series:

$latex \rho_{x, y} = \tfrac{1}{(n – 1)} \sum_{i = 1}^{n} \tfrac{(x_{i} – \overline{x}).(y_{i} – \overline{y})}{\sigma_{x}.\sigma_{y}} $

So what are the possible ways in which this formula could be different when investigating the correlation of a series with itself, lagged?

Let’s start by changing x_{i} and y_{i} to x_{i} and x_{(i – k)} to make it obvious the values are all drawn from the SAME series, and we will have to change the range of the summation and therefor the denominator preceding the summation (the series length is (n – k + 1) so the denominator becomes (n – k + 1 – 1) = (n – k)) :

$latex \rho_{k} = \tfrac{1}{(n – k)} \sum_{i = 1 + k}^{n} \tfrac{(x_{i} – \overline{x1}).(x_{i – k} – \overline{xk})}{\sigma _{x1}.\sigma _{xk}} $

where x1 and xk are the original series minus the first k terms, and the original series minus the last k terms, respectively. Rho k is short hand for the autocorrelation coefficient at lag k.

The first thing that jumps out is that we are using different means and standard deviations for x1 and xk, but they are sub-sets of the same series – so we ought to be able to use the mean and standard deviation of the entire series of x for BOTH. This takes us to:

$latex \rho _{k} = \tfrac{1}{(n – k)} \sum_{i = 1 + k}^{n} \tfrac{(x_{i} – \overline{x}).(x_{i – k} – \overline{x})}{\sigma _{x}^{‘2}} $

I used $latex \sigma _{x}^{‘} $ to indicate we have to make a correction to sigma for the length of the series. Sigma is the standard deviation for a series of length n, but the series we are correlating have lengths (n – k + 1). Since standard deviation scales as the square root of length:

$latex \sigma_{x}^{‘} = \sqrt{ \tfrac{ \left ( n – k \right )}{\left ( n – 1 \right )}}.\sigma_{x} $

Which we can substitute into our calculation to get:

$latex \rho _{k} = \tfrac{1}{(n – 1)} \sum_{i = 1 + k}^{n} \tfrac{(x_{i} – \overline{x}).(x_{i – k} – \overline{x})}{\sigma _{x}^{2}} $

The results of this formula match the results of the acf(x, max.lag) function!

I understand the differences and they seem reasonable. Agita gone away. Whew!

## Share

If you found this post informative, please share it!