Trying to figure out an easy way to do Pearson’s Chi-Squared Test in R when verifying goodness of fit. I thought the function chisq.test() would help!

Problems?

• There’s no way to tell chisq.test the correct degrees of freedom (“df”) – it cannot figure this out itself. My stats text tells me to reduce df by 1 for each value derived from the observations. If I have 10 buckets of counts AND I have calculated the mean and sd from the data, df = 7 NOT 9 as assumed by chisq.test().

• In the test distribution, again according to my text, no expected count should be less than 5. All the chisq.test does is tell you the results are unreliable – it doesn’t fix them leaving the pre-processing up to you.

So here’s how to use R’s chisq.test()

First, I use a little function to set up some likely buckets for the observations “obs”:

Ian.Chisq.Buckets <- function(obs){
obs.hist <- hist(obs,breaks=seq(min(obs), max(obs), length=(1 + length(obs) / 5))) return(obs.hist)
}

This returns an object of class “histogram” which can be passed to the next step once the distribution to be tested for goodness of fit is set up. For example: test.cdf <- pnorm(obs.hist\$breaks, mean(obs), sd(obs)) generates a cdf that corresponds to the buckets in the histogram of the observations.

Passing the histogram of the observations and the cdf to the next function accumulates the buckets to ensure a minimum count of 5 in every bucket for the expected distribution, and accumulates the buckets in parallel for the observed distribution. Then the chisq.test() is run, setting rescale.p=T, because counts not probabilities are being passed.

Ian.Chisq.Test <- function(obs.hist, cdf){
if (!(class(obs.hist) == “histogram” | class(obs.hist) == “numeric”)) stop(“problem with obs.hist”)
if (length(obs.counts) != length(cdf) – 1) stop(“observations and cum dist function mismatch”)
if (any(obs.counts < 0) | any(cdf < 0) | any(cdf > 1)) stop(“Counts or cdf messed up”)
if (class(obs.hist) == “histogram”){
obs.counts <- obs.hist\$counts
}else{
obs.counts <- obs.hist
}
cdf <- sort(cdf)
cdf[length(cdf)] <- 1
exp.counts <- length(obs) * (cdf[-1] - cdf[-length(cdf)])
mid.point <- round(length(exp.counts)/2, 0)
for (i in 1:mid.point){
if (exp.counts[i] < 5){
exp.counts[i+1] <- exp.counts[i+1] + exp.counts[i]
exp.counts[i] <- 0
obs.counts[i+1] <- obs.counts[i+1] + obs.counts[i]
obs.counts[i] <- 0
}
}
for (i in length(exp.counts):mid.point){
if (exp.counts[i] < 5){
exp.counts[i-1] <- exp.counts[i-1] + exp.counts[i]
exp.counts[i] <- 0
obs.counts[i-1] <- obs.counts[i-1] + obs.counts[i]
obs.counts[i] <- 0
}
}
obs.counts <- obs.counts[exp.counts != 0]
exp.counts <- exp.counts[exp.counts != 0]
if (length(exp.counts) < 2){
stop(“insufficient expected observations”)
}else{
mess.str <- "If 'df' is incorrect, use Ian.Chisq.P(chisq.results, df) to correctly calculate P-Value"
return(c(chisq.test(obs.counts, p=exp.counts, rescale.p=T), “Message”=mess.str))
}
}

The chisq.test() results are returned with a reminder to think about df.

If df is actually different, then pass the whole set of test results plus the correct df to the last function:

Ian.Chisq.P <- function(chisq.results, df){
p.stat <- pchisq(chisq.results\$statistic, df, lower.tail=F)
names(p.stat) <- "P-Statistic"
return(p.stat)
}

The correct P-Statistic is returned!

## New Commodity Pool Launches

Please provide your name and email address so we can send you our quarterly compilation of new commodity pools registered with NFA.

## Biggest Hedge Funds By \$AUM

Please provide your name and email address so we can send you our quarterly compilation of biggest hedge funds by \$AUM as reported on the SEC's Form ADV.

## Free Hedge Fund ODD Toolkit

Note this is US Letter size. If you want A4, use the other button!

## Free Hedge Fund ODD Toolkit

Note this is A4 size. If you want US Letter, use the other button!

## Subscribe:

Don't miss our next hedge fund article!