Introduction to Statistics

Part 3: The Central Limit Theorem (CLT)

In this video, we explain:


(Click on a heading to navigate the video to that section)

Welcome to part three, on the central limit theorem. Let’s get started.

Let’s recap the problem we’re trying to solve here. The problem: is the true mean height of sasquatches really 8 feet, when the last 10 random sightings give an average of 7.54?

We’ve learned so far that this difference in between the sampled and the assumed means alone is not enough evidence to decide, because the sample’s difference could just be due to the variance inherent to the sasquatch population’s heights.

Because of sampling error, the logic of an answer becomes a little more complex.

So let’s finally list out the steps, then, of the logical (“inferential”) way we’ll need to answer this problem -- that is, how we’ll go about accepting or rejecting Woodland Pete’s claim that the true mean height of sasquatches is 8 feet.

  1. First, we’ll need to calculate the likely range of means that can be expected (for random samples) to arise out of the variance in the population alone. We know that, when taking a sample out of a population that varies, there’s a range of possible means the sample could have. But we need to know, precisely, what a likely and what an UNlikely mean is for a sample.
  2. Because, if our observed sample mean 7.54 is outside of that likely range…
  3. Then we would have logical evidence that says the true population mean height is not likely 8 feet after all.

When we model a population with certain assumptions, but then the laws of random sampling tell us that the population as assumed would be extremely unlikely to produce the real-world sample we observed, then we have reason to doubt the assumptions are true. Of course, this also assumes that our real-world sampling technique was good ( = random).

These “laws of random sampling” are what the CLT is all about, and will be the subject of this video. They help us complete step 1 (above), and infer, in advance, the range of means we’d likely see in a sample taken randomly from a population with an assumed mean and an assumed standard deviation.

As we start to explore the CLT, let’s establish a definition of what we mean when we say an outcome is “unlikely.”

One very common probabilistic definition, and the one we’ll adopt, is that something is unlikely if it has a less than 5%, or a 1-in-20, chance of occurring.

What does this look like on the normal model? Recall the Empirical Rule, which says that 95% of the population is WITHIN 2 standard deviations of the mean. (This leaves 5% of the population falling OUTSIDE of 2 standard deviations.)

So this leaves 2.5% of normal data >+2 standard deviations (having a Z-score greater than 2); and 2.5% <-2 standard deviations (a Z-score of less than -2), for a total of 5% of a normal data outside of the +/- 2 range (i.e. being more extreme than 2 standard deviations from the center).

When we observe one individual datum, at random, from a normal distribution, then, it has only a combined 5% chance of being from one of these tails, that is, having an absolute Z-score greater than 2. And by our definition, then, any absolute Z-score greater than 2 is “unlikely.”

Let’s take the example of flipping 100 coins again.

We know that when data has variance, a sample of it might not give the exact true mean. Because, for example, flipping 100 coins might give 52 tails, but this doesn’t prove bias in a coin, because we know intuitively that this result isn’t that unlikely, give the random nature -- the variance -- in a fair coin.

But if we start to get more tails -- 60, 75 -- at what number precisely can we say that something other than randomness alone (like a true weight imbalance on the coin, a real bias in the coin) appears to be at work?

The CLT is what allows us to answer questions like these. As the “law of random sampling” (as we could call it), the CLT gives us the means were likely to see, in a random sample taken from a population with a given mean and variance.

It says that: as n increases (as our samples get larger), this range of possible samples means (1) becomes normally distributed, even if the population’s original distribution is not normal, and (2) the sample mean will converge on the true mean.

This is perhaps THE most important idea in all of statistics. It’s what allows us to reach logical conclusions with only partial knowledge of the whole (with only a mere sample of a population).

To see how, to demonstrate the CLT, and show these TWO attributes at work, let’s perform a quick experiment.

What we’re really interested in here is the precise VARIANCE (the “error”) in sample means that’s possible, since samples of the same size from the same population can deliver a variety of different means.

So, if we’re interested in knowing how samples from the same fixed population can vary, why not repeat the same sample (of the same n) many times, chart all the sample means we get, and see what we get for variance between all of them?

Because we know that n increases accuracy. So we just need to repeat the sample enough times, and the size of this “sample of samples” will eventually be large enough that we can be sure it’s giving pretty accurate results of how sample means truly vary (for a fixed n). We’ll be able to spot the mathematical pattern if there is one.

Let’s start with the smallest sample possible, sampling ONE coin flip at a time (n = 1). And let’s take 100 samples -- and just trust me when I say 100 will be enough “samples of samples” to see the CLT at work.

Let’s also go ahead and assign heads a value of 0 and tails a value of 1, to convert our flips into binary data that we can do math with.

Note that when we run a simulation of 100 samples of n = 1, because each sample has only 1 member, each sample will have a mean of its own value -- that is, of 0 or 1.

Now because a sample of n = 1 is equivalent to selecting a random individual from the population, you might be thinking “isn’t this [100 samples of 1 individual] the same as 1 sample of 100 individuals [flips]?” And you’re right, but we’re currently interested in seeing how MEANS vary between samples, so we’ve set up things like this.

So, running the simulation: what how many means of 0 and means of 1 will we get out of 100 samples, and what will be the “mean of these 100 means?” And how will these 100 0s or 1s vary around this mean of means?

This is all better shown than explained. So here we go: we got 49 means of 0, and 51 means of 1. Overall, this averaged into 0.51. We’re calling this our mean of 100 means.

The variance of all these means is 0.2524. Recall how variance is calculated for a sample. [It’s the “n - 1” average of all the 0s and 1s squared difference from 0.51. ( = the “n - 1” average of 49 (0 - 0.51)^2s with (1-0.51)^2s).] Variance this is the 100 means’ “average squared difference” from the mean of means, 0.51 -- which makes sense. 0 and 1 are both about 0.5 away, and 0.5^2 = 0.25.

Why did we take the means of samples of only n = 1 first? Samples of 1, of course, are equivalent to sampling individuals, and taking the mean of one individual doesn’t do anything. So what we’ve actually obtained here is the DATA’S own variance -- the variance we can assume is inherent to the population itself. But, I wanted our process to be consistent -- we’ll bump up n to 10 next.

But this exercise gave us a baseline, so we can see what happens to the variance of sample means as we INCREASE sample size This way, we can expose, and measure, the CLT’s effects.

Importantly too, note at this point how the sample data (assumed to be representative of the population) has a UNIFORM distribution (that is, every outcome (out of 2) has about equal chance of occurring). It’s NOT a normally distributed population.

Now, we’ll bump up the sample size to n = 10, and simulate 100 samples of 10 flips each. How do the sample means vary now?

Of course, with n > 1, the mean of each sample will NOT equal 1 or 0. For example, the first sample of n = 10 flips gave 4 tails, for a mean (proportion) of 0.4.

After 100 of these sample means were taken, the mean of all means was 0.486, which is very near a fair coin’s (the population’s) expected central tendency of 0.5.

  1. In the rightmost column of the truncated table, you can see each sample means’ squared difference from this overall mean -- the n - 1 average of all these gives a variance of the sample means of 0.0247. (Recall the variance in the coin itself is this new variance one-tenth of the original variance in the population!)
  2. The variance (also known as error when referring to a distribution of sample means) has decreased because, when you average 10 flips (instead of just 1) you mostly get 3s, 4s, 5s, 6s, or 7s (in fact, 91 out of the 100 10-flip samples had a mean between 0.3 and 0.7. 61 of the samples had a mean between 0.4 and 0.6.
  3. An error of ~0.025 converts to a st dev of ~0.16. IMPORTANTLY, note that this standard deviation (of a distribution of sample means) has a special name in this context. To keep it distinct from the standard deviation of data, and to designate it as the standard deviation of the MEANS of sampled data, it’s called standard error (SE)

This plot of our 100 samples (of 10 flips each) shows us that, when we take an average measure (of now n = 10 individuals), instead of a single, individual (n = 1) measure, the possible sample means begin to form a normal distribution around the true mean of 0.5. This is the CLT at work -- as n increases, random sampling produces more sample means close to the true central tendency, and less at more extreme means.

  1. And, remember how I pointed out that the the coin’s distribution was not normal, but uniform? EVEN THOUGH THE POPULATION’S ORIGINAL DISTRIBUTION IS NOT NORMAL, that we’re averaging 10 individuals together, we’re already getting sample means roughly falling into a normal distribution, with tenfold less variance.
  2. But we’re interested in learning the general rule of how sample means vary, for any sample size (any n). So what happens when we increase our sample size to 100 flips per sample?

Here are the results of that simulation. Remember: out of our 100 samples of n = 10 flips per sample, 61 showed a proportion of tails between 0.4 and 0.6.

Now, out of our 100 samples of n = 10 flips per sample, 61 had a proportion of tails between 0.4 and 0.6. Now, in 100 samples of 100 flips, 98 had a proportion between 0.4 and 0.6.

You can see this in the variance, or error -- that is, the average squared difference of each 100-flip sample from the central tendency of (very nearly) 0.5. It’s now only hundred times smaller than the original 0.25 variance of one individual coin flip!

With sample size n up tenfold (from n = 10), the sample means are varying another factor of ten less from the central tendency.

You can see this visually in our plot on how tightly the sample means are converging (in a normal distribution) around the true central tendency.

Let’s summarize our coin flip experiment. For all sample sizes -- all n -- the true central tendency has shown through. The mean of means [for the 3 different sample sizes] has always been very near the true central tendency of 0.5. But: the variance of the sample means has been shrinking as n increases.

Our experiment has shown the CLT at work: that, as promised, as n increases, the range of samples’ means (1) becomes normally distributed, even if the population isn’t, and (2) converges around the population’s true mean (i.e. the error of the sample mean decreases).

But what’s the mathematical pattern? Looking at our summary table, the general rule is: as n increases, the variance of the sample means decreases, or divides, BY A FACTOR OF N. We kept increasing n by a factor of 10, so the variance of the sample means kept decreasing by a factor of 10.

The numbers in our summary table have a little noise in them, because they are still sample data.

But, distilling from what we know is true [or assume] about a fair coin, the central tendency (the overall mean) should always be 0.5, and the original variance [0.2524] thus 0.25 even, and so on down the line.

Of course, according to the CLT, if we had sampled more than 100 trials each time, our sample numbers would be closer to the expected numbers!

Rounding these values from our sample numbers to their ideal, expected numbers, we can derive several formulas.

  1. The expected population variance (denoted by σ² [“sigma-squared”]) is the variance found between individuals in the population (in this case, for single coin flips). For a fair coin with an expected center of 0.5, the population (individual) variance is 0.25.
  2. Because standard deviation is just the square root of this [of variance], the expected standard deviation of the population is 0.5.

Now, where n > 1:

  1. The expected error, that is, the expected variance of the distribution of sample means, is σ² (the original population variance) divided by n.
  2. Because standard deviation is always the square root of variance, the standard error (SE (that is, the standard deviation of the distribution of expected sample means) is the SQUARE ROOT of error. And error being the population variance divided by n, SE is thus the population variance σ², divided by the SQUARE ROOT of n, which simplifies to σ (the population standard deviation) over the square root n. [For example, 0.16 can be derived from 0.5 / sqrt(10).]

What does 0.16 mean, for samples of 10 flips? Or 0.05 for samples of 100 [flips]? What is standard error? This is important. Taking samples of 100 flips each: the expected means (proportion of tails) of those samples will fall into a normal distribution with a standard deviation (called SE) of 0.05. And according to the Empirical Rule, which says that 68% of a normal distribution will fall within +/- 1 standard deviation of the mean. So for a normal distribution of 100 sample means of 100 flips, 68% of the time (about two-thirds of the time) the sample mean (the proportion of tails) will be between 0.5 +/- 1 x [the standard error] 0.05, or 0.45 and 0.55. In summary, 68% of sample means of samples of 100 flips should be between 45 and 55 tails (a mean of 0.45 to 0.55 is what we’d expect from sample means of 100 flips).

Don’t worry if the idea of standard error is still a little fuzzy -- we’ll keep coming back to it as we come along.

One other thing to note: we’ve also happened to demonstrate that, for binary data, you don’t need to measure standard deviation separately-- it’s enough just to know the mean. Why?

The mean of binary data (the “proportion” p) already tells you its variance with p(1-p), because binary data can only vary between 0 and 1. So notice that there’s no σ or σ² in this formula [of the SE of a proportion] for standard error. If you want a deeper understanding of why this is the case, you can hang on at the end of this video and I’ll give a quick demo of the math.

The formulas we’ve demonstrated are very useful. With them, we can predict how the means of samples from a given population will DISTRIBUTE, given just that population’s own center, variance, and our sample size n. AND: we know that, even with non-normal population data, the means of samples from that data WILL be normal, with a variance equal to the population’s divided by a factor of n.

So, the upshot is: we can use the normal model -- Z-scores on the Standard Normal Distribution (SND) -- to find the likelihood of observing any range of sample means, given just the population’s parameters.

In summary, the normal sample means curve (the “SMC,” as we’ll call it) can be constructed from the population parameters alone. The SMC has a center μ [“mu”] (the population’s own expected center), and a standard deviation called standard error equal to σ over the square root of n -- that is, the population’s inherent standard deviation, divided by [the square root of] our sample size.

Now we know that we can use the normal model to “map” the SMC, we know we can use Z-scores to find the likelihood of any particular sample mean value [or range of values].

The only question is: how should we define an “unlikely” sample mean? Well, if

  1. We stick with our 5% probability definition from before, and...
  2. We know that the means of samples [comprising the SMC] are normally distributed, then...
  3. We say that the mean of a sample is “unlikely” if it’s more extreme than +/- 2 SEs away from the expected sample mean. That is, that is has an absolute Z-score of > 2 on the SMC.

An equivalent way handle this definition of “unlikely” then, is anything OUTSIDE OF a 95% likely range. Because every probability has its complement, which is 1 minus it. (Here, a 100% chance of everything, minus 5% unlikely -- as 2.5% unlikely to the right and 2.5% unlikely to the left -- leaves 95% likely.)

On the normal SMC then, the 95% likely range is everything WITHIN +/-2 SEs. And “unlikely” means outside of +/- 2 SEs. That is, if a sample mean’s Z-score is greater than +/- 2, that means it’s from one of the 2.5% tails on the sample means’ normal curve -- and gets designated UNlikely.

The 95% likely ranges are easily calculated, then, for our different sample sizes. We just use this simple formula: we take the the population’s expected center (μ) +/- 2 x SE. (“2” coming from the Empirical Rule, to give us the 95% likely range for ANY normal distribution; and standard error being particular to the [sample] means curve, the SMC, whose standard deviation is SE.

The 95% likely ranges are easily calculated, then, for our different sample sizes. And this last entry [ [0.4, 0.6], for n = 100] actually answers a question we’ve been asking: namely, out of 100 flips, what’s a suspicious (or unlikely) number of tails to get, where you might think that coin is biased?

Well, we did 0.5 +/- 2 x 0.05, to get a likely range of [0.4, 0.6]. Meaning: 95% of the time, in samples of 100 flips, you’d get a proportion of tails between 0.4 and 0.6. You’d get 40 to 60 tails, 95% of the time. So if you got something like 39 or less, or 61 or more, you have reason under this definition of unlikely to think that this coin is giving an unusual [“unlikely”] result. It’s giving a result from one of the 2.5% tails.

So, have we finally put this question to rest definitively? That if you have a coin which gives 39 tails in 100 flips, have we proved it’s biased once and for all?

Well, you can see we’re dealing with probabilities here. A result of 39 tails or less out of 100 flips still has a 2.5% chance of occurring in a truly, mathematically fair coin. And this (unfortunately) is as far as statistics will take us. It’s up to us, that is, to define the probability that’s finally small enough that we rule something (quote) definitively proven.

But let’s try a bonus question: what’s the 95% likely range of the sample mean for samples of 1000 flips? Pause the video and try it!

To solve that: μ +/- 2 times the SE → 0.5 +/- 2 x 0.25 / sqrt(1000) ~ [0.468 , 0.532]. Meaning, out of 1000 flips, 95% of the time, you’d expect to see between 468 and 532 tails.

Finally, one last but very important note about this “likely” range: careful not to confuse it with a “confidence interval,” if you’ve already heard of those.

  1. This likely range is calculating the range of values we would expect to see for a SAMPLE MEAN, 95% percent of the time in repeated samples of the same size, IN SAMPLES FROM THE ASSUMED POPULATION. That is, assuming the the center μ is the TRUE center of the data. That’s why we’ve adopted the population’s OWN μ [of 0.5] in constructing the 95% “likely” range.
  2. On other hand, a confidence interval is very similar, but works from a different assumption. It does NOT assume the population center is the true center. Instead, it takes the SAMPLE’s mean (x̅, NOT μ) as the best estimate of the TRUE mean and then uses that as the center of an interval estimating an interval for the true mean.

Don’t worry if that’s a little fuzzy, we’ll come back to confidence intervals in the next video.

So it’s that time again, to go over the key points.

Remember our question from part 1, which was: given a population with variance, can we infer how the means of samples from that population will vary?

In this video, our coin experiment showed that, yes, thanks to the Central Limit Theorem, we can infer that. The variance of sample means will be the population’s variance reduced by a factor of n. And, the the sample means distribution [curve] will be normal. So, we can use Z-scores to find the probability of any given sample mean.

And now, just for the love of more visuals, here’s the coin experiment data again, but with everything on the same scale.

Note how the n = 1 data (representative of the population distribution) is NOT NORMAL (it’s uniform). But the distribution of MEANS of samples where n > 1 ARE normal, and they’re narrowing as n increases. The expected means spikes higher and higher at the central tendency.

We can use the statmagic app to see the CLT at work. In the normal distribution calculator, this time, choose x̅ [“x-bar”] mode, because we’re going to be looking at distributions of sample means.

From our coin example, μ was 0.5, standard deviation [σ] was 0.5, and let’s start with an n of 10. Just ignore the “Z vs. T” toggle for now. So first, let’s calculator the probability that a sample of 10 flips gives a proportion of tails between 0.4 and 0.6. Hit CALC, and there you go -- Z-score included.

But why are there two curves? The flat one in the background represents the population, if it were normal (but remember, it doesn’t have to be). The taller, more narrow one in front is the SMC. The relative heights of these curves shows the relative frequency, and hence also probability, of observing any value. So you can see that the probability of observing a value central to this population is higher when you take a sample of 10 flips and average them all, as opposed to taking a single flip.

Now let’s see when we bump up the n to 100. Hit CALC again, and what’s happened? The SMC narrows towards the center even more, which says: the sample mean of an even bigger sample will be even more likely give you a value near the population’s true mean. The probability of a sample mean falling between 0.4 and 0.6 is pretty large now (over 95%).

Bump n up to 1000, and I bet you can guess what will happen. The probability of seeing a sample [mean] between these two bounds [0.4 and 0.6] is almost certain. In fact, it’s rounding to an even 1.

And this is the CLT, visually. Your odds that a sample’s mean will reflect the population’s true mean increase with sample size.

To get back to the main question of this series: how does what we’ve learned about the CLT apply to finding the true mean height of sasquatches?

Well, it helps us to calculate how likely a sample mean of 7.54 feet was, assuming the mean population height really was 8 feet.

So, to finally bring it all together and answer this one, check out the next video, where we’ll walk through hypothesis testing and practice some fuller logic of this type of inferential statistics.

Finally, as promised, here’s the math showing how the variance of binary data (the variance of “proportional” data) down to p(1-p).

First, here’s the general form of variance: it’s the average squared difference of every data point and its mean.

But binary data is not continuous: it only comes as a 0 or 1. So we can rewrite this “average squared difference” [variance] formula as so: where you have the number of 0s there are [n0] times the squared difference between 0 and the center p, plus the number of 1s there are, times the squared difference between 1 and the center p, all over the total number of observations (the total number of 0s plus the number of 1s).

Now secondly, we know that p is simply the number of 1s (or “successes”) there are, over the total n [hence p = n1 / (n0 + n1)]. And then 1-p is just the number of 0s [“failures”] there are, over the total n [since p + (1 - p) = 1] [Hence 1 - p = n0 / (n0 + n1)].

Throwing these identities for p and 1 - p back into the [variance] equation, we get this [see slide]. And consolidating it, we get this, and consolidating some more, and some more [see slide for the mathematical reasoning here], and then recognizing what’s there, we have these same identities again [of p = n1 / (n0 + n1) TIMES 1 - p = n0 / (n0 + n1)], which has simplified down to p(1-p), which equals the variance for binary data.

So if we go back and look where we discussed the formula for standard error (standard error being the population standard deviation over the square root of n): if binary data’s variance is p(1-p), [then] its standard deviation is the square root of that [= sqrt(p(1-p)]. So the standard error of binary data is: the square root of (p(1-p) / n).