I download CSI end of day data (EODD) from the TradingBlox website. Then I convert the data to ratio adjusted contracts (RadContracts). All is well … or is it? What if there are “issues” in the EODD? I need a solid data validation protocol.
For example, in the latest batch of data, I found the following issues:
- Euro Index (CU_0_I0B) – there is a jump at 1998-01-04 from 0.6 to 1.2 in the unadjusted close (RawCl)
- US Dollar Index (DX_0_I0B) – RawCl is 1/10 of the correct value for half the series
- Nasdaq Composite (ND_0_I0B) – RawCl is a factor 10 too large for the entire series
- Lean Hogs (LH_0_I0B) – 5 roll-overs exceed 20% price changes
- Natural Gas (NG20_I0B) – 5 roll-overs exceed 20% price changes
- US Bond (US_0_I0B) – Panama adjustment changes within the same contract month
This list is by no means exhaustive. Obviously a data cleaning process is required before any analysis can be completed:
So far, issues seem to fall into five categories:
- Explainable and fixable issues such as that in the Euro series caused by splicing D-Mark to Euro series
- Fixable errors such as those found in the DX and ND data series – there is an easily identifiable pattern that can be reversed
- Errors that amount to noise such as those in the US Bond series – easy to find, of very small magnitude, but annoying to have.
- Genuinely unusual data points that are internally consistent such as LH and NG (OHLC all make sense, Panama adjustment is stable, etc)
- Errors I cannot find but are surely there – these are the ones I fear most. These are the ones that, hopefully, will be reduced by gradually improving my Validation Script.
This issue highlights the ‘nice to have’ of a second data source to check, for example, that the huge roll-overs in Lean Hogs and Nat Gas are genuine.
So far, my checks are very limited:
- Check that the panama adjustment only changes on the day of a contract roll.
- Check that, on a roll, the daily change does not exceed a threshold.
- Check that, outside of a roll, the daily change does not exceed a threshold.
I will do some searching to find additional data validation steps to improve the quality of the raw data I am using for analysis.
This post is part of a longer series:
- Garbage In => Garbage Out
- Creating Good Price Series
- This Post
- Generalized Extreme Value Distribution
- Some additional protocols.