Hedge Fund Data Hygiene - The Scrubbing Brush

Garbage In = Garbage Out

The results of data analysis are only as good as the data you use. There are a variety of hedge fund data vendors out there. You may have access via your Bloomberg terminal, or perhaps you have purchased access to data from a well-known hedge fund data vendor. Regardless of your data source there will always be issues with the data:

  1. The data vendor is constrained by database architecture decisions that were made in the past. It is a strategic challenge for them to update their architecture because of their installed base.
  2. The sources of the data, the hedge funds themselves, will be “strategic” about the data they provide. Naturally, they do not want to be filtered out – they want to make a pitch.
  3. There will be errors in the data as well as records that are unusable for various reasons.

In this article I take a look at a variety of issues that can arise with purchased hedge fund data. I start at the macro level and get progressively more detailed. I provide R code snippets wherever appropriate to make the article more useful.

I have based this article on my experience with my current go-to data source: Eurekahedge. I don’t receive any promotional consideration from them. I hope I don’t come across as critical of Eurekahedge, their product is amongst the best in the industry, and you will find similar issues with any data you choose.

Managing the Data

Protect the Original Data

Always, always, always save the original data in its original purchased form. You must always be able to go back to the source. So create a directory where you store the data in the form it was provided to you. Make sure that this location is backed up by whatever data recovery scheme you employ.

You will likely want to take the data through a number of steps as part of your hedge fund data hygiene process. I try to do absolutely nothing to the data outside of my R data scrubbing scripts. In my case I receive the data in the form of a zipped excel spreadsheet. I simply unzip it and save each tab as a comma separated variable file.

Processing the Data

Don’t be tempted to try to do some of the data cleansing in excel and some in your pre-processing script. Build a standalone data scrubbing tool. Your process should be clean like this:

Raw Data -> Data Scrubbing -> Clean Data -> Analysis

Also, don’t be tempted to filter out any of the records in the data cleaning stage. The filtering step should be at the beginning of your analysis process, not the end of your data preparation process.

Naming Conventions

Think about how you want to name each file. It’s a trivial thing but I recommend using a “YYYY-MM Tab Name.csv” format for file naming (By “Tab Name” I mean the name of the tab in the excel spreadsheet). This ensures the files sort in chronological order. It also ensures that in my script I just need to set a variable to the correct “YYYY-MM” and the script will open and operate on the correct files.

With the output, I tend NOT to save csv files. Rather I save the variables (vectors, matrices, data frames) in an RData file. In this way I avoid a proliferation of little csv files; I have a single RData file. I use the same naming convention for the output file: “YYYY-MM The_Clean_Data.RData”. Once again, this ensures all the files will sort together making them easier to find at a later date.

Preserving Prior Results

In my data cleansing scripts, before processing a file, I check to see if an output file for the process already exists. If it does, I do not over-write it, I simply rename it by pre-pending a date and time stamp. This way, if something goes wrong I don’t lose the work I did previously, and I can tell what order the files were generated. It is better to have too many copies of data than not enough.

Here’s the basic script:

if (file.exists(paste(fileMonth, "CleanData.RData"))){
file.copy(from=paste(fileMonth, "CleanData.RData"),
to=paste(format(Sys.time(), "%Y-%m-%d %H%Mhrs %Ssecs"), "CleanData.RData"))
}

Let’s say we run the script at 17:15:26 on April 29, 2018. The script checks to see if a copy of “2018-04 CleanData.RData” exists. If so, the script makes a copy of it with the name “2018-04-29 1715hrs 26secs CleanData.RData”. So long as no attempt is made to run the script twice within the same second, any previous copy of the file will be saved. All such files sort in chronological order.

Reading the Data

When reading data into R I always have problems. Following is a typical list and how I fix them:

  • Odd lines at the top of the csv file – use the “skip” argument.
  • Blank lines after the header row – just delete them using df[-line_numbers, ].
  • Quotes and apostrophes – make sure the “quote” argument is correct.
  • Not available or Null – make sure the “na.strings” argument is correct.
  • Non-standard character sets – you can use UTF-8 package to detect and handle odd characters. This is a nightmare problem when it arises!
  • Missing end of line for the last line of the csv file – open file in text editor and add return to last line,

Following are the settings I use to read in the data files from Eurekahedge:

# Fund AUM and Return (NAV)
fundNAV <- read.table("2018-01_NAV_AUM_Data.csv", 
                      header=F, 
                      sep=",",
                      quote="\"",
                      row.names=NULL,
                      na.strings = c("NA", "n/a"), 
                      strip.white=T, 
                      stringsAsFactors=F)
columnNames <- format(as.Date(paste("28", fundNAV[1, 3:ncol(fundNAV)]), "%d %b %y"), "%Y-%m-%d")

Note that I read in the data with header set to False so I can easily grab the first row and use it for my column headers BEFORE R does that really annoying thing of adding “X”s and “.”s all over the place. Also, note that for the fundNAV data I set the column names to the 28th of each month. Here the names are not applied to the columns right away, they are set up for use a little later. More on that below.

Planning Ahead

Think ahead! If you are going to be doing the exact same thing to every row or column in your analysis, save the data in matrices not data frames. For example, don’t have a data frame with the fund name in column 1, and the monthly return data in the remaining columns. You will drive yourself batshit ignoring the first column every time you want to manipulate the return data. Name the rows using the fund name so the data is retained in the matrix. Then you can use simple operations and the “apply” function to manipulate the data in the matrix.

Taking the idea a step further, see if you can separate out the data into different variables to make your computations more efficient. For example, my data has alternating rows of fund returns and AUM – I split this into two variables retData and aumData.

The data frame on the left is pulled apart to create the two matrices on the right.

Sort and name the columns in a meaningful way. Some databases, including Eurekahedge, present the data with the most recent in the first column. You will be better off reversing the order so the oldest is on the left and the most recent is on the right. Think about calculating an equity curve using “cumprod” and then plotting it: the data needs to run from left to right.

Name the columns with the month end. Use YYYY-MM-DD convention. I use the 28th of the month for obvious reasons. It just makes life easier. You can format the names in future for charts and tables. For example, if you just want to display YYYY-MM, use format(date_data, “%Y-%M”).

Data Conversions

Logicals

Yes / No type data should be converted to T/F. This will allow you to use short-hand. Instead of selecting rows from a data frame by testing:

df <- df[df[ , 1] == "y", ]

you will instead be able to use:

df <- df[df[ , 1], ]

to get all the rows where column 1 is set to true. It doesn’t seem like a big deal, but it is one less expression to be evaluated for every single row (about 23,000 in the case of Eurekahedge).

Decide what to do about unspecified or NA values in T/F columns. In some cases it makes sense to convert NA to TRUE, and in others it should be FALSE. Sometimes, NA is the only value that makes sense.

Remember to convert all the variations of “Y/N”, “y/n”, “Yes/No”, “YES/NO”, “yes/no”, that may appear in your data. It is smart to inspect a column of data using “unique” to make sure you have caught every variation:

sort(unique(df[ , "yes_no_column"]))

What if there is a typo: “ye”? I know what you are thinking: “How can there be a typo when the data is entered using a drop-down?” Stuff happens.

Numbers Not Text

Some databases will include numbers that are stored as text, or at least they are by the time they get to you. For example, “$10,000,000” instead of the integer 10000000. There are two problems with this:

  • Your analytical package, in my case R, will unlikely be able to operate on these variables as-is.
  • At data entry, folks get tempted to add additional information. For example: “$10,000,000 (or equivalent in Euros)”.

Here’s the code I use to fix these problems:

firstSplit <- function(string, splitter) {
# provides support to the "FixNumeric" function. 
# It splits a given string at the first occurrence of a test string (called "splitter")
  strsplit(string, splitter, fixed=T)[[1]][1]
}

FixNumeric <- function(dataFrame, colName){
  newColumn <- dataFrame[ , colName] # grab the data to fix
  colnames(dataFrame)[which(colnames(dataFrame) == colName)] <-
    paste(colName, "Old") # rename original column
  newColumn[newColumn == ""] <- NA # set empty values to NA
  qualifierRows <- grep(" (", newColumn, fixed = T) # find rows with extra info.
  newColumn[qualifierRows] <- 
    sapply(newColumn[qualifierRows], firstSplit, splitter=" (")
# split off the extra info
  noneRows <- grep("None", newColumn, fixed = T) # grab rows with value "none"
  newColumn[noneRows] <- "0" # set to zero
# newColumn <- gsub("$", "", newColumn, fixed=T) # get rid of dollar symbols
  newColumn <- gsub(",", "", newColumn, fixed=T) # get rid of commas
  newColumn <- as.numeric(newColumn) # convert to numeric
  dataFrame <- cbind(dataFrame, newColumn) # add column to end
  colnames(dataFrame)[ncol(dataFrame)] <- colName # Set the column name
  return(dataFrame)
}

Remember: your milage may vary. A quick test to see if you have successfully converted every element in the column is to simply try some arithmetic and see if it fails. Make sure to use grep on the column to inspect all the text after the split to check you are not throwing away something vital (sort(unique()) is your friend).

Units

Think about what units you will want to use for calculations. For example, do not store 1% as “1”, use “0.01”. This saves you having to mess around scaling values all the time. Building an equity curve becomes:

equityCurve <- c(1, cumprod(1 + returns))

instead of

equityCurve <- c(1, cumprod(1 + returns / 100))

Over a twenty year monthly time frame, that saves 240 calculations. While that doesn’t sound much, if you are simulating, say, 100,000 portfolios, this simple step saves 24,000,000 calculations!

Similarly, if the value you are dealing with is millions, use millions. Don’t use 10 and try to remember at some later point that “10” here means 10 million, “10” there means 10 thousand, and over there, “10” actually means “10”.

Use Look-Up Tables

For some problems you will need to create look-up tables which you can use to translate from the data in the table to the data you need to work with.

For example, I need to be able to filter using liquidity, redemption frequency, and lock-up data. There are many different ways funds provide this information and the data entry folks may not know how to standardize the data. As a consequence, I might see redemption frequency specified as follows: “30 days”, “Monthly”, “31 days”, “15th day of the month”, “5th business day of the month”, etc. For my purposes, these are all essentially the same redemption frequency. The shortest interval between two redemptions is a month.

The way to deal with this is to decide how you want the data to be presented so you can work with it. Then build a translation or cross-reference table from the raw data to your data. You should save these tables as csv files so you can manually edit them as needed.

Here are the last 20 lines (out of 739 lines) of my notice period comma separated variable look-up table:

Data Cleansing Notice Period Look-Up Table

Your code simply runs through the raw data, and looks up in the translation table the way you want that data represented. It should add your representation of the data as an extra column in the data frame. As always, DO NOT over-write the original data.

Since you will be doing this on a monthly basis, you will need to ensure your script can handle a new term it hasn’t seen before. Have it throw an error when it can’t find a term in the look-up table, telling you exactly where and why it failed. The you can edit the look-up table manually to add the new term.

Every so often you should manually go through the csv files to ensure that the “translations” you have used are still effective.

Test, Check, Fix

Don’t ever assume your script is working. Test, check, and fix every command.

Consider having your script report back to you what it is doing. For example, have your script tell you it replaced 2,324 instances of “None” with 0 in data frame “fundDetails” column “Fund Size ($m)”. Then you will have a chance to apply a reality check.

Have your script finish up by selecting, say, 50 records at random for you to manually review. Since you have not deleted or over-written any data, you have the opportunity to compare the original with the new.

Conclusion

Cleaning up data is a chore that cannot be avoided. If you don’t do it, you will waste good data and draw false conclusions from bad data. Your hedge fund data hygiene process is critical to the hedge fund screening and analysis process.

I hope you can make use of some of the ideas above either directly or indirectly. Feel free to comment below with any suggestions or criticisms.

New Commodity Pool Launches

Please provide your name and email address so we can send you our quarterly compilation of new commodity pools registered with NFA.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

Biggest Hedge Funds By $AUM

Please provide your name and email address so we can send you our quarterly compilation of biggest hedge funds by $AUM as reported on the SEC's Form ADV.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

The Case For Hedge Funds And Managed Futures

Please provide your name and email address so we can send you a pdf of the article you requested.

Thank you. Your file will be available after you confirm your subscription. Check your in-box!

Share This

Share

If you found this post informative, please share it!