The Foe: Curve-Fitting
My biggest fear is to be fooled by randomness: curve-fitting is the enemy hidden in plain sight. I have completed a lot of system development research and testing over the last month. A large part of that work has involved searching for robust parameter sets. I have been thinking about good methods of testing the degree to which an optimal parameter set is a product of curve-fitting. Some form of cross-validation seems like the way to go. First I will briefly describe the concept. Then review some good and bad ideas on how to implement the process.
A system trader aims to build a model that determines a course of action for each instrument at each bar. The optimization process involves tuning the model to deliver the highest CAGR subject to constraints. The present problem can be framed as follows: The system is optimized based on the data we have up to this point in time, but will it continue to perform in the future? Can we use the data we have in a way that gives us as large a sample space as possible to use for optimization, and gives us some measure of how the system will perform against new data?
The Ally: Cross-Validation
One answer is to use some form of cross-validation. The general principle of cross-validation is to divide up a dataset such that one set is used for analysis (training data) and the other is used to test the success of the analysis (validation data). The different methods of subdividing the data fall into two categories:
- Repeated Sub-Sampling – Create one random partition from the dataset and hold it aside. Optimize using the remaining data. Validate on the partition. Repeat as often as you want. Average the results.
- K-Fold Sampling – Create K random partitions in the data, hold aside one partition and optimize using the remainder. Repeat K times. Average the results.
Clearly these techniques, while simple in principle, require a number of decisions to be made, most obviously, how large should the partitions be, and should they be contiguous? There are also more subtle decisions such as should the subsampling be done in such a way that some essential characteristic(s) of the data set are preserved (stratification).
The first method allows an extra degree of freedom in that the number of tests (“folds”) is independent of the partitioning, but some of the data may never be used in the validation set and some may never be used in the training set. In the second method the partition size is a function of the number of folds (100 / folds %), but every data point will get equal use as part of the training and validation sets.
The most important assumption behind the technique is that the partitions are randomly selected from the same population. This has implications for how the data is partitioned.
The results of the process can be used in three ways:
- To estimate the performance of the model when applied to “unseen” data
- To compare the performance of one model vs another structurally different one.
- To compare the performance of one set of model parameters with another set of parameters for the same model.
Consider a dataset made up of the daily OHLC of 40 instruments over 20 years, about 200,000 sets of OHLC. How might that data be partitioned?
- Training data: years 1 – 16; Validation data: years 17 – 20. Test on 80% and validate on 20%. The partitioning is not random therefor this is inappropriate. In addition, there is only one training and one validation set.
- Randomly partition 4 years of data as a validation set. This is better, but can be improved. Will the partition be 4 contiguous years? If so then years 1-3 and 17-20 will be represented in a biased way.
- There are 40 x 20 = 80 instrument-years. Randomly partition 16 instrument years as a validation set. This seems like it meets the requirements, but what if there are partial years?
- Why use years? Why not weeks? Sample from 40 x 20 x 52 = 41,600 instrument weeks and make partitions of 8,320 instrument weeks.
- Randomly partition 8 instruments as a validation set. This seems like it would work if one assumes that all instruments behave in essentially the same way (i.e. the various price series are drawn from the same population of price series).
In the context of system design, with the implicit assumption that price series are not random, there are key differences compared to the scientific modeling scenario which assumes independent observations. The longer the system’s average holding periods the more important this issue becomes. Consider partitioning by instrument by time:
For a given randomly selected instrument, 2005-01-01 thru 2005-12-31 has been randomly selected as a validation partition. A particular back-test has an open position at 2004-12-31. Should this position be closed at the open on Jan 1? A trading system with an appreciable look-back (e.g. a 200 DMA) needs priming data. This back-test arrives at 2006-01-01. How should the look-back be handled? Obviously, data from the validation partition cannot be used. Perhaps the training dataset can be spliced together into a new continuous contract, effectively “rolling-over” the gaps created by removing the validation dataset. This solves both problems in that, as far as the optimization process is concerned, the validation partition no longer exists.
Next, the validation datasets and the back-tests using them have to be managed – should they be spliced together? What about priming? Perhaps the validation set can be spliced to the priming period used by the training set.
This seems like a horrendous task!
I feel that the simplest way to do this, if your system uses a sufficiently large portfolio, is to partition randomly based on instruments. Consider my 39 instrument portfolio. If I want to partition 20% of the data I can randomly select 8 instruments to hold for my validation set.
Once the process is complete (and it could be very resource intensive to train numerous times), the results have to be evaluated. In the scientific model context, testing is easy: use the model parameters derived from the training set and use them to make predictions on the validation set. Some measure of the accuracy of the predictions is calculated, such as MSE, and averaged across the K folds. This gives us a measure of the ability of this model to be trained on one dataset and make predictions about another. If we have two candidate models, the model that has the best average accuracy is considered the best model.
In the trading system context the results of this process can be used in a variety of quantitative and qualitative ways:
- To distinguish between two systems with similar architecture and similar performance.
- To estimate likely performance in the future.
- Identify insignificant parameters – those that vary a lot from one optimized system to another.
- A “robust” system will have similar optimal parameter sets regardless of the partitioning.
- A “robust” system will have similar performance in both the training and validation datasets
In the future I hope to publish some results of studies constructed using both methods of sub-sampling as well as exploring how best to carry out the sub-sampling.
Edit: I have since published a post with a 5-Fold Cross-Validation example.