Wednesday, October 13, 2010

Why is the sum of two uniform randoms not uniform?

Lance Norkog asks on the Mahout mailing list why adding two uniformly distributed random variables gives a pyramidal distributed value. I would normally answer on the mailing list, but here I can use lovely math notation. As I mentioned on-list, this is a very basic result that is related to the law of large numbers.

If we were to draw a picture of the joint distribution of these variables \(x\) and \(y\), we would get something that is 1 inside the \([0,1] \times [0,1]\) square and 0 outside that region.

For a given value \(\alpha\) of the sum \(x + y\), there is a diagonal line segment where \(x+y=\alpha\) and \(x\) and \(y\) are in the square. Where \(z \le 0\) or \(z \ge 2\) that intersection vanishes and for \(0 \lt z \lt 2\), that intersection varies in length. The probability of the sum having some particular value z is proportional to the length of that intersection. As you can imagine, the intersection varies in size linearly and it reaches a maximum where z = 1.

For the sum of three random variables, the situation is more complex to reason about geometrically because we need to worry about the intersection of a plane and a cube.  For more variables, the geometry is not worth the trouble.

If we tackle the problem a bit more rigorously, then the easiest way to approach the problem is to compute the cumulative distribution of various values of sums. That leads to a convolution integral over the density functions involved. Since the densities are all 1, the integration limits are the key to the value and those limits have to broken down into cases. Actually doing these integrals is a pretty rare activity since the limit is approximated so well after just a few iterations.

Just how quickly that convergence happens is can be seen by looking at the empirical distribution of the sum of three uniform deviates.  I used something very like this R code to produce the graph below:


   hist(runif(100000)+runif(100000)+runif(100000), 
        breaks=50, prob=T)
   lines(seq(-1,4,by=0.01), dnorm(seq(-1,4,by=0.01), 
        1.5, sqrt(1/4)))


In this graph, the red curve is the normal distribution with the same mean and standard deviation.  As you can see, the peak is a tiny bit too high and the tails are just a skoshi too long for the normal to be a good description of the samples of the sum.   
This is, however, just the sum of three random values.  If you sum more values, the convergence to the normal distribution is very strong and by the time you are adding six uniform random values together, the difference between the distributions is no longer visible in a graph like this and can only be detected numerically using lots of data and clever things like a Kolmogorov-Smirnov test.

The moral here is that there isn't much way to avoid this regression to the normal distribution and distorting the data to avoid it is probably pointless.

But if you are like me, being just a little more normal always made it easier to get along.

3 comments:

steven said...

Nice formal explanation of why 6 is more likely than a 2 in the sum of two fair die rolls. Worth noting that not all distributions converge towards the Normal distribution when summed. Eg the Cauchy distribution (it's actually the only counter example I know of)

Ted Dunning ... apparently Bayesian said...

You are exactly correct that some distributions do not exhibit the convergence. The key criterion is that every distribution with finite variance will exhibit the convergence and those without will not. The mean of the Cauchy is also undefined.

The Cauchy distribution is a special case of the t distribution with one degree of freedom and is a useful archetype for thinking about problems that lead to "Black Swan" sorts of events. A very interesting exercise involves building random walks where the step is not normally distributed, but is t distributed. As the number of degrees of freedom approaches 1, the random walk becomes more and more dominated by rare explosive steps with very calm regions between. Perhaps that would be a good blog posting!

steven said...

I know I'd read it.