Trying to figure out an easy way to do Pearson’s Chi-Squared Test in R when verifying goodness of fit. I thought the function chisq.test() would help!

Problems?

  • There’s no way to tell chisq.test the correct degrees of freedom (“df”) – it cannot figure this out itself. My stats text tells me to reduce df by 1 for each value derived from the observations. If I have 10 buckets of counts AND I have calculated the mean and sd from the data, df = 7 NOT 9 as assumed by chisq.test().


  • In the test distribution, again according to my text, no expected count should be less than 5. All the chisq.test does is tell you the results are unreliable – it doesn’t fix them leaving the pre-processing up to you.

So here’s how to use R’s chisq.test()

First, I use a little function to set up some likely buckets for the observations “obs”:

Ian.Chisq.Buckets <- function(obs){
obs.hist <- hist(obs,breaks=seq(min(obs), max(obs), length=(1 + length(obs) / 5))) return(obs.hist)
}

This returns an object of class “histogram” which can be passed to the next step once the distribution to be tested for goodness of fit is set up. For example: test.cdf <- pnorm(obs.hist$breaks, mean(obs), sd(obs)) generates a cdf that corresponds to the buckets in the histogram of the observations.

Passing the histogram of the observations and the cdf to the next function accumulates the buckets to ensure a minimum count of 5 in every bucket for the expected distribution, and accumulates the buckets in parallel for the observed distribution. Then the chisq.test() is run, setting rescale.p=T, because counts not probabilities are being passed.

Ian.Chisq.Test <- function(obs.hist, cdf){
if (!(class(obs.hist) == “histogram” | class(obs.hist) == “numeric”)) stop(“problem with obs.hist”)
if (length(obs.counts) != length(cdf) – 1) stop(“observations and cum dist function mismatch”)
if (any(obs.counts < 0) | any(cdf < 0) | any(cdf > 1)) stop(“Counts or cdf messed up”)
if (class(obs.hist) == “histogram”){
obs.counts <- obs.hist$counts
}else{
obs.counts <- obs.hist
}
cdf <- sort(cdf)
cdf[length(cdf)] <- 1
exp.counts <- length(obs) * (cdf[-1] - cdf[-length(cdf)])
mid.point <- round(length(exp.counts)/2, 0)
for (i in 1:mid.point){
if (exp.counts[i] < 5){
exp.counts[i+1] <- exp.counts[i+1] + exp.counts[i]
exp.counts[i] <- 0
obs.counts[i+1] <- obs.counts[i+1] + obs.counts[i]
obs.counts[i] <- 0
}
}
for (i in length(exp.counts):mid.point){
if (exp.counts[i] < 5){
exp.counts[i-1] <- exp.counts[i-1] + exp.counts[i]
exp.counts[i] <- 0
obs.counts[i-1] <- obs.counts[i-1] + obs.counts[i]
obs.counts[i] <- 0
}
}
obs.counts <- obs.counts[exp.counts != 0]
exp.counts <- exp.counts[exp.counts != 0]
if (length(exp.counts) < 2){
stop(“insufficient expected observations”)
}else{
mess.str <- "If 'df' is incorrect, use Ian.Chisq.P(chisq.results, df) to correctly calculate P-Value"
return(c(chisq.test(obs.counts, p=exp.counts, rescale.p=T), “Message”=mess.str))
}
}

The chisq.test() results are returned with a reminder to think about df.

If df is actually different, then pass the whole set of test results plus the correct df to the last function:

Ian.Chisq.P <- function(chisq.results, df){
p.stat <- pchisq(chisq.results$statistic, df, lower.tail=F)
names(p.stat) <- "P-Statistic"
return(p.stat)
}

The correct P-Statistic is returned!

New Commodity Pool Launches

Please provide your name and email address so we can send you our monthly compilation of new commodity pools registered with NFA.

You can unsubscribe at any time.

Thank you! Your file will be available after you have completed a two-step confirmation process. Check your in-box for step 1.

Biggest Hedge Funds By $AUM

Please provide your name and email address so we can send you our monthly compilation of biggest hedge funds by $AUM as reported on the SEC's Form ADV.

You can unsubscribe at any time.

Thank you! Your file will be available after you have completed a two-step confirmation process. Check your in-box for step 1.

Share This

Share

If you found this post informative, please share it!