I download CSI end of day data (EODD) from the TradingBlox website. Then I convert the data to ratio adjusted contracts (RadContracts). All is well … or is it? What if there are “issues” in the EODD? I need a solid data validation protocol.

For example, in the latest batch of data, I found the following issues:

  • Euro Index (CU_0_I0B) – there is a jump at 1998-01-04 from 0.6 to 1.2 in the unadjusted close (RawCl)
  • US Dollar Index (DX_0_I0B) – RawCl is 1/10 of the correct value for half the series
  • Nasdaq Composite (ND_0_I0B) – RawCl is a factor 10 too large for the entire series
  • Lean Hogs (LH_0_I0B) – 5 roll-overs exceed 20% price changes
  • Natural Gas (NG20_I0B) – 5 roll-overs exceed 20% price changes
  • US Bond (US_0_I0B) – Panama adjustment changes within the same contract month

This list is by no means exhaustive. Obviously a data cleaning process is required before any analysis can be completed:Flowchart establishing data validation protocol.

So far, issues seem to fall into five categories:

  • Explainable and fixable issues such as that in the Euro series caused by splicing D-Mark to Euro series
  • Fixable errors such as those found in the DX and ND data series – there is an easily identifiable pattern that can be reversed
  • Errors that amount to noise such as those in the US Bond series – easy to find, of very small magnitude, but annoying to have.
  • Genuinely unusual data points that are internally consistent such as LH and NG (OHLC all make sense, Panama adjustment is stable, etc)
  • Errors I cannot find but are surely there – these are the ones I fear most. These are the ones that, hopefully, will be reduced by gradually improving my Validation Script.

This issue highlights the ‘nice to have’ of a second data source to check, for example, that the huge roll-overs in Lean Hogs and Nat Gas are genuine.

So far, my checks are very limited:

  1. Check that the panama adjustment only changes on the day of a contract roll.
  2. Check that, on a roll, the daily change does not exceed a threshold.
  3. Check that, outside of a roll, the daily change does not exceed a threshold.

I will do some searching to find additional data validation steps to improve the quality of the raw data I am using for analysis.

This post is part of a longer series:

  1. Garbage In => Garbage Out
  2. Creating Good Price Series
  3. This Post
  4. Generalized Extreme Value Distribution
  5. Some additional protocols.

New Commodity Pool Launches

Please provide your name and email address so we can send you our monthly compilation of new commodity pools registered with NFA.

You can unsubscribe at any time.

Thank you! Your file will be available after you have completed a two-step confirmation process. Check your in-box for step 1.

Biggest Hedge Funds By $AUM

Please provide your name and email address so we can send you our monthly compilation of biggest hedge funds by $AUM as reported on the SEC's Form ADV.

You can unsubscribe at any time.

Thank you! Your file will be available after you have completed a two-step confirmation process. Check your in-box for step 1.

Share This

Share

If you found this post informative, please share it!