I thought I would share some of the basics of how to detect change in time series in a couple or three posts. I have been working on the math behind my manager-monitoring services with the objective of making them more rigorous. Rather than just looking at the statistics and spotting anomalies, I want to be able to pick them out objectively. First I will review off-line change detection for a single change, then look at multiple changes and finally extend that to on-line monitoring.

Off-line monitoring means we have the entire data series and we are looking for the point or points in the history where some aspect of the data (such as the mean or variance) changed in a statistically significant way. On-line monitoring is similar but is carried out in real-time as the data arrives. For example: Based on today’s return data, can we now conclude, based on the preponderance of the evidence, that a manager’s performance has changed compared to some baseline?

In this post I will start out with a statement of the problem illustrated by the historical performance of Winton’s Diversified Program. I will give some background with a quick refresher on Maximum Likelihood Estimation (MLE) and its extension to the Generalized Likelihood Ratio (GLR). I am going to explore the problem of detecting a single change at an unknown time in the mean and/or standard deviation of a Gaussian series. I will show how using R’s probability density functions make the math a lot easier.

## Statement of the Problem

The chart to the right illustrates the problem: we are tracking the Winton Diversified Fund’s (WDF) performance. By eye we can see that the left side of the chart is different from the right: the scatter is much reduced.

But can we quantify what has happened using some simple measures? The fast moving average return plot is too noisy to tell much, while the slow moving average indicates a drop in rate of return but it has a huge lag. The standard deviation plot appears to rise into 2001 then fall from then on.

The change detection problem is to identify when and by how much the characteristics of WDF’s return series changed to a **significant** degree.

## Maximum Likelihood Estimation (MLE)

The statistician R A Fisher championed the use of “likelihood” for estimating the parameters for statistical models of the world. Likelihood is subtly different from probability: probability is about the data given a statistical model, whereas likelihood is about the model given the data. Fisher introduced the term “likelihood” to emphasize this difference (Fisher, R.A. (1921) “On the ‘probable error’ of a coefficient of correlation deduced from a small sample.” Metron 1 pt 4, 3-32). In the real world, for the most part, we have the data, it is the model that is the unknown, so likelihood is what we need to work with.

Fisher’s basic idea was that the likelihood of a set of assumptions about a model (parameter values, or even the choice of model itself, H) given the data (e – evidence) is proportional to the probability of getting the data given the model.

$latex P(H \parallel e) \propto P(e \mid H) $

The parallel bar is to emphasize that the hypothesis is not conditioned on the evidence in the traditional probability sense. You should be able to see the relationship to Bayes’ theorem:

$latex P(H \mid e) = \frac{P(H).P(e \mid H)}{P(e)} $

We don’t know the probability of getting the data we have, but since we will use the same data to evaluate various hypotheses, P(e) will be a constant. Taking a Bayesian approach we can argue that we choose any of our hypotheses with equal probability, thus P(H) is also a constant. Thus, P(H)/P(e) is a constant of proportionality, and we get Fisher’s likelihood relationship.

To find the model that has the strongest explanatory power given the data, we simply search the parameter space until we have found the set of parameters with the maximum likelihood.

## Generalized Likelihood Ratio (GLR)

In his 1972 book “Likelihood”, A W F Edwards stated his Likelihood Axiom:

Within the framework of a statistical model, all the information which the data provide concerning the relative merits of two hypotheses is contained in the likelihood ratio of those hypotheses on the data, and the likelihood ratio is to be interpreted as the degree to which the data support the one hypothesis against the other.

This tidily pulls together Fisher’s ideas telling us how to compare two or more competing hypotheses to explain the data: we divide the likelihood of one by the other to get the likelihood ratio. This ratio tells us the relative odds of the competing hypotheses being correct. If the answer is “2” then one hypothesis is twice as likely to be true as the other.

If we know something about the distribution of the likelihood ratio, we can derive a p-value and interpret its significance. Under the right circumstances (exponential distributions) the log of the square of the likelihood ratio has a chi-squared distribution with degrees of freedom (df) equal to the df of the numerator hypothesis minus the df of the denominator hypothesis.

Here’s how it works:

$latex \Lambda_{1}^{n}=\frac{P(H_{1}\parallel y_{i})}{P(H_{0}\parallel y_{i})}=\frac{P(y_{i}\mid H_{1,i})}{P(y_{i}\mid H_{0,i})} $

where yi is the data set indexed from 1 to n, and $latex \Lambda_{1}^{n}$ is the generalized likelihood ratio for the competing hypotheses H_{0,i} and H_{1,i}. The index attached to the hypotheses is to keep the approach as general as possible – I am permitting the hypotheses to change (hint, hint) during the time-series.

To calculate the joint probability for the numerator and for the denominator across the whole data set, y, we multiply the probability for each data point. It is customary to take logs of both sides of the equation. This avoids numerical over-runs from multiplying many likelihoods together, instead we add the logs. The results:

$latex ln(\Lambda_{1}^{n})=l_{1}^{n}=\sum_{1}^{n}ln(P(y_{i}\mid H_{1,i}))-\sum_{1}^{n}ln(P(y_{i}\mid H_{0,i})) $

This last line, the generalized log-likelihood ratio is the key to everything that follows. A final note, there seems to be no general agreement about whether H_{0} should be the numerator or the denominator. I have presented it the way that makes sense to me.

## A Single Change in a Series

Let’s set up some controlled data and see how this works on a time series. I am going to use a Gaussian distribution with a known mean and standard deviation that change at some random point during the series. So let’s set up a GLR algorithm to detect that change.

### The Algorithm

The null hypothesis says there is no change during the series, and the maximum likelihood estimate of the mean and standard deviation for the series are … the mean, $latex \tilde{\mu}$, and standard deviation, $latex \tilde{\sigma}$ (duh). The alternate hypothesis is that there is a change at time t0, and before that time the series had a mean $latex \mu_{0} $ and standard deviation $latex \sigma_{0}$. From time t0 onwards the series has a mean $latex \mu_{1}$ and standard deviation $latex \sigma_{1}$. Just as for the full series, the MLE of the means and sd’s for each sub-set of the series are simply the mean and standard deviation of that bit of the series. Substituting this into the log-likelihood from above, we get:

$latex l_{1}^{n}=\sum_{1}^{t_{0}-1}ln(P(y_{i}\mid \mu_{0}, \sigma_{0}))+\sum_{t_{0}}^{n}ln(P(y_{i}\mid \mu_{1}, \sigma_{1}))-\sum_{1}^{n}ln(P(y_{i}\mid \tilde{\mu}, \tilde{\sigma})) $

Notice that all I have done is split the first summation of logs into a 1 to (t0 – 1) section and a t0 to n section. This is the equivalent of evaluating the likelihood ratios for the first (t0 – 1) data points using $latex \mu_{0}$ and $latex \sigma_{0}$ vs. $latex \tilde{\mu}$ and $latex \tilde{\sigma}$ and then evaluating the last (n – t0 + 1) data points using $latex \mu_{1}$ and $latex \sigma_{1}$ vs. $latex \tilde{\mu}$ and $latex \tilde{\sigma}$.

So all I have to do is evaluate the log-likelihood ratio for each value of t0 and find the maximum value. That will tell me the most likely change time.

### Some Results

The following charts show the raw data, the mean and standard deviation used to generate the data and the GLR-derived estimates of mean and standard deviation. The first chart shows a small change in mean (0.008 to 0.012), the second a change in sd (0.05 to 0.10), the third a large change in mean (0.008 to 0.040) and the fourth a small change in mean (0.008 to 0.012) plus a change in sd (0.05 to 0.10). I have tried to use values that might reasonably be found in stream of returns.

Use the “Contact Us” form or the comments section to let me know if you would like the script I used to generate the charts – you can play around with most aspects of it to get a feel for how the algorithm responds.

### Some observations

- In these time series, I found the algorithm is better at detecting changes in sd than mean – in financial return series the sd’s of return tend to be large relative to the mean of returns. I guess the “noise” drowns out the signal. Fortunately for us, changes in mean return are often accompanied by changes in sd of returns, as the manager adjusts, for example, leverage. Even quite large changes in mean will go undetected.
- You may look at some of the plots and notice that the estimates are significantly different from the underlying parameters that generated the series. This is always the case with stochastic series: a sample mean or standard deviation can be quite far from the population mean and standard deviation. Randomness can be pretty random, you know!
- The numerator in the likelihood ratio (the positive parts of the sum of logs) will always be larger than the numerator as it has 4 degrees of freedom – 2 means and 2 sd’s, compared to the denominator’s 2 df’s. This means there will always be a candidate change point.
- There is no “significance” threshold in use. This means the algorithm returns the most likely location of a change even if there isn’t an actual change.
- There is a tendency to favor the beginning and end of the series as a likely change candidate because there is a small number of samples in one of the sub-sets e.g. points 1 to 4, vs points 5 to 100. So I added in a “margin” to the algorithm: I look for the most significant change outside of the margin (the first and last 12 data points in the case).
- When I set up the algorithm there were various alternatives I could have used. I could, for example, assume the mean is constant and
**only**test for change in standard deviation, or vice-versa. My thought is if the distribution has changed I don’t see how you can assume only part of it has changed – check for everything!

## Using R to Make it Easier

You have a choice. You can make life hard for yourself and substitute in the probability density function, simplify and then substitute in the dataset to calculate the log-likelihood, or you can just make use of R and let it calculate the densities using whatever probability distribution you want.

How do you make it hard for yourself? Let’s expand out the likelihood function assuming that we are dealing with a Gaussian dataset with a constant standard deviation and a change in mean from $latex \mu_{0}$ to $latex \mu_{1}$ at time $latex t_{0}$:

$latex l_{1}^{n}=ln(\frac{P(y_{1} \mid \mu_{0})…P(y_{t_{0}-1} \mid \mu_{0}).P(y_{t_{0}} \mid \mu_{1})…P(y_{n} \mid \mu_{1})}{P(y_{1} \mid \mu)…P(y_{t_{0}-1} \mid \mu).P(y_{t_{0}} \mid \mu)…P(y_{n} \mid \mu)})$

$latex =ln(P(y_{1} \mid \mu_{0}))+…+ln(P(y_{n} \mid \mu_{1}))-ln(P(y_{1} \mid \mu))-…-ln(P(y_{n} \mid \mu))) $

$latex P(y_{i} \mid \mu) = \frac{1}{ \sigma \sqrt{2 \pi}} exp(- \frac{(y_{i}- \mu)^{2}}{2 \sigma^{2}}) \Rightarrow $

$latex l_{1}^{n}=- \frac{(y_{1}- \mu_{0})^{2}}{2 \sigma^{2}}-…- \frac{(y_{n}- \mu_{1})^{2}}{2 \sigma^{2}}+ \frac{(y_{1}- \mu)^{2}}{2 \sigma^{2}}+…+ \frac{(y_{n}- \mu)^{2}}{2 \sigma^{2}} $

Multiply out the brackets and the y-squared terms cancel. Sum the terms and substitute in the following:

$latex \mu_{0}= \frac{1}{(t_{0}-1)} \Sigma_{1}^{t_{0}-1}y_{i}, \mu_{1}= \frac{1}{(n-t_{0}+1)} \Sigma_{t_{0}}^{n}y_{i}, \mu= \frac{1}{n} \Sigma_{1}^{n}y_{i} $

and we get,

$latex l_{1}^{n}= \frac{1}{2 \sigma^{2}} \left ( \frac{1}{(t_{0}-1)}( \Sigma_{1}^{t_{0}-1}y_{i})^{2}+ \frac{1}{(n-t_{0}+1)}( \Sigma_{t_{0}}^{n}y_{i})^{2}- \frac{1}{n}( \Sigma_{1}^{n}y_{i})^{2} \right ) $

So now we have made the arithmetic very easy.

Using R we have no need to go through this “textbook” process. We can use the dnorm() function to do all the heavy lifting:

yi is the data series

sigma.tilde <- sd(yi); mu.tilde <- mean(yi)

log.likelihood <- sum(log(dnorm(yi[1:(t.0 – 1)], mean(yi[1:(t.0 – 1)]), sigma.tilde)),

log(dnorm(yi[1:t.0], mean(yi[t.0:length(yi)]), sigma.tilde)),

-log(dnorm(yi, mu.tilde, sigma.tilde)))

Much easier on the brain! Obviously you can extend this to use any of the probability density functions you like, not just Gaussian. I am planning on investigating how to apply this technique using the empirical cumulative distribution function too. If you want to see how I implemented the algorithm in this post, use the contact form above and I can email the R script to you.

## Conclusion

We’ve covered some hard stuff (background math) and some easy stuff (off-line single change). In my next post I will look at multiple changes in the off-line setting and bring us back to the Winton Diversified Fund. I will also address the issue of testing for significance.

I have since added to this topic by discussing detecting multiple changes in time series.

## Share

If you found this post informative, please share it!