The server called by the java script that implemented my LaTex equations has gone bye-bye. So I am trying to fix past posts so equations render in a legible way – apologies. In the mean time …

I have been getting better acquainted with Monte Carlo simulations. They play an important role in implementing Vince’s LSPM. Since I plan on writing my own version of LSPM, I better get a good understanding of the approach.

A good introduction is to use the “classic” example of estimating pi using the area of a circle. So, imagine a circle of radius, r = 1, centered on the origin. The area of the circle in the first quadrant (x, y >=0) is pi/4. Imagine a square of side = 1, bottom left corner at the origin. The area of the square is 1. If we randomly sprinkle N points over the square, N.pi/4 of them, on average, will end up inside the circle and N.(1 – pi/4) will end up outside. N = 1,000 looks like this:

Monte Carlo techniques are often discussed in terms of integrations. In this case we are integrating the function:

only we are using an approximation:

How accurate will this estimate of pi/4 be? Well, dropping a point at random into the x, y plane is a Bernoulli trial with a probability of p = pi/4 of being inside the circle. We will estimate p using N trials, so the standard deviation of the distribution of the means will be inline sigma = sqrt{frac{p.left ( 1 – p right )}{N}} allowing us to derive a confidence interval along with our estimate of pi/4.

On average, p = pi/4. So, for an estimate built from 10,000 points, we would expect a sd in the estimate of pi to be 4 x sqrt((pi/4)*(1 – pi/4)/10000) = 0.01642. To see if this is how the process comes out, we should try it out many, many times take the average and sd of the estimates of pi.

Below is an R script that generates a random set of trial x, y coordinates (uniformly distributed between zero and one) of size “trials”. Then it tests each pair of coordinates to see if they are inside or outside the circle. It totals all those inside the circle by the total number of trials to estimate pi/4. It takes 4 lines of code. Finally, the estimate is multiplied by 4 to get an estimate for pi itself.

The script repeats the process “runs” times so we can get a distribution of estimates for pi. Most of the script is to generate the three plots in this blog post: an example of one test (above), a distribution of estimates of pi and a qq-plot of the distribution of estimates to show that it is normally distributed (at the end of the post).

# Ian Rayner July 2011

#

# Estimating pi using MC simulation – uniform dist’n

#

# equation of circle of radius r centered on origin is

# x^2 + y^2 = r^2

# So x ~ U(0,1) and y ~ U(0,1), if x^2 + y^2 > r^2 then

# point is outside circle.

#

# ratio of points inside circle to total points = pi / 4

#~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

trials <- 1000

radius <- 1

runs <- 10000

plot.on <- T

pi.est <- vector(mode=”numeric”, length=runs)

for (i in 1:runs){

x.loc <- runif(trials, 0, radius)

y.loc <- runif(trials, 0, radius)

test <- (x.loc^2 + y.loc^2) <= radius^2

pi.est[i] <- sum(test) / trials

}

pi.est <- 4 * pi.est

w <- max(abs(pi.est – pi))

breaks=pi + seq(-w, w, l=17)

norm.counts <- runs * pnorm(breaks, mean(pi.est), sd(pi.est))

norm.counts <- norm.counts[-1] – norm.counts[-length(norm.counts)]

if (plot.on){

plot(x.loc, y.loc, main=”Example Samples”, pch=”.”)

lines(x=seq(0, radius, l=1001),

y=sqrt(radius^2 – seq(0, radius, l=1001)^2),

col=”red”, lwd=2)

histo <- hist(pi.est, breaks=breaks, plot=F)

dev.new()

plot(histo, ylim=c(0, max(histo$counts, norm.counts)),

main=”Histogram of estimates of pi”,

sub=paste(“RED = pi, BLUE = mean =”, round(mean(pi.est), 5),

” sd = “, round(sd(pi.est), 5),

” trials = “, trials, sep=””),

xlab=”Estimate of Pi”)

abline(v=pi, col=”red”, lwd=4)

abline(v=mean(pi.est), col=”blue”, lwd=2)

points(histo$mids, norm.counts,

col=”blue”)

dev.new()

qqplot(norm.counts, histo$counts,

main=”Q-Q Plot Counts for Est Pi vs Normal Dist’n”)

abline(0, 1)

}

Below are the results of running the script. Each MC simulation involves 10,000 trials. The script runs the simulation 1,000 times. The left chart below shows a histogram of the results of each run. pi is marked on the x-axis in red. The blue line and blue dots represent the normal approximation to the expected distribution using the mean and the sd of all 1,000 runs. On the right is a Q-Q plot comparing the actual distribution to the normal distribution using the mean and sd.

The mean of the runs is 3.14195 vs pi = 3.14159; an error of 0.01%. The sd of the distribution was 0.01632 vs 0.01642 calculated above; an error of 0.6%. So the MC simulation performs as expected. But can we do better? It is immediately apparent that we are testing a far larger area than we need to. The entire area of interest (where the curve runs) is in the top right half of the square. Maybe we should try a different distribution for the population of test points than a uniform distribution on the unit square. The next post will take this step.

When running MC simulations (for LSPM and other applications) the question always arises, "How many iterations should I let it run in order to get such-and-such accuracy?" Too many and you waste time (quadratically!), too few and your answer is less accurate than you wanted.

One way to go is to measure some data, then fit a function that calculates this output: "With 90% confidence, the error after N iterations is less than or equal to f(N, LR) where LR is the Lopsidedness Ratio." You could make 100 runs and measure error vs N for each run. Then find the 90% point of the error distribution and fit a curve to it. Repeat for various LRs. Done.

Thanks for the feedback Sluggo.

I initially planned 3 MC posts: this one, one using a more efficient distribution and one on the topic of stratification (all built around the "pi") example.

I will take your suggestion on board and add an extra (2?) post on sample size vs convergence and the issue of the proportion of the sample space inside the curve. This has relevance when it comes to stratification of the sample space as you generally bottom out when the curve just clips the corner of the sub-divided space – all the samples fall on one side or the other, and you can no longer use the normal approximation to the binomial to estimate error.

I hope I will do your suggestions justice!

I think you will get some additional insight from plotting (error) vs (# MC samples) across a set of 20 or 50 runs, as in this figure http://goo.gl/OoXGo . Note that the X axis is logarithmic.

I think you will get even more insight from monitoring the speed of convergence of an MC simulation, where the number of dots INSIDE the figure, is waaay more (or waaay less) than the number of dots OUTSIDE the figure. Your test example here can do this by setting the radius of your red circle to something smaller than 1; let me suggest r=0.20 .

You can have some fun fitting an equation to your (error) vs (# MC samples) data. MC theory says convergence is inversely proportional to sqrt(#samples) but — what is the coefficient? And how do you incorporate the Lopsidedness Ratio into your convergence equation? LR = MIN( (inside/outside) , (outside/inside) )