I download CSI end of day data (EODD) from the TradingBlox website. Then I convert the data to ratio adjusted contracts (RadContracts). All is well … or is it? What if there are “issues” in the EODD? I need a solid data validation protocol.

For example, in the latest batch of data, I found the following issues:

  • Euro Index (CU_0_I0B) – there is a jump at 1998-01-04 from 0.6 to 1.2 in the unadjusted close (RawCl)
  • US Dollar Index (DX_0_I0B) – RawCl is 1/10 of the correct value for half the series
  • Nasdaq Composite (ND_0_I0B) – RawCl is a factor 10 too large for the entire series
  • Lean Hogs (LH_0_I0B) – 5 roll-overs exceed 20% price changes
  • Natural Gas (NG20_I0B) – 5 roll-overs exceed 20% price changes
  • US Bond (US_0_I0B) – Panama adjustment changes within the same contract month

This list is by no means exhaustive. Obviously a data cleaning process is required before any analysis can be completed:Flowchart establishing data validation protocol.

So far, issues seem to fall into five categories:

  • Explainable and fixable issues such as that in the Euro series caused by splicing D-Mark to Euro series
  • Fixable errors such as those found in the DX and ND data series – there is an easily identifiable pattern that can be reversed
  • Errors that amount to noise such as those in the US Bond series – easy to find, of very small magnitude, but annoying to have.
  • Genuinely unusual data points that are internally consistent such as LH and NG (OHLC all make sense, Panama adjustment is stable, etc)
  • Errors I cannot find but are surely there – these are the ones I fear most. These are the ones that, hopefully, will be reduced by gradually improving my Validation Script.

This issue highlights the ‘nice to have’ of a second data source to check, for example, that the huge roll-overs in Lean Hogs and Nat Gas are genuine.

So far, my checks are very limited:

  1. Check that the panama adjustment only changes on the day of a contract roll.
  2. Check that, on a roll, the daily change does not exceed a threshold.
  3. Check that, outside of a roll, the daily change does not exceed a threshold.

I will do some searching to find additional data validation steps to improve the quality of the raw data I am using for analysis.

This post is part of a longer series:

  1. Garbage In => Garbage Out
  2. Creating Good Price Series
  3. This Post
  4. Generalized Extreme Value Distribution
  5. Some additional protocols.

New Commodity Pool Launches

Please provide your name and email address so we can send you our quarterly compilation of new commodity pools registered with NFA.

We hate SPAM and promise to keep your email address safe.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

Biggest Hedge Funds By $AUM

Please provide your name and email address so we can send you our quarterly compilation of biggest hedge funds by $AUM as reported on the SEC's Form ADV.

We hate SPAM and promise to keep your email address safe.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

Free Hedge Fund ODD Toolkit

Note this is US Letter size. If you want A4, use the other button!

We hate SPAM and promise to keep your email address safe.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

Free Hedge Fund ODD Toolkit

Note this is A4 size. If you want US Letter, use the other button!

We hate SPAM and promise to keep your email address safe.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

Subscribe:

Don't miss our next hedge fund article!

We hate SPAM and promise to keep your email address safe.

Thank you! Check your email for confirmation message.

Share This

Share

If you found this post informative, please share it!