In this video, we explain:
(Click on a heading to navigate the video to that section)
And now, to bring it all together, I give you part 4, on hypothesis testing.
Let’s recap our problem. Is the true mean height of sasquatches really 8 feet, when our sample of 10 random sightings gave an average of 7.54 feet? And as we’ve covered before, this difference itself is not enough to say one way or the other -- because the difference could just be due to the variance within the sasquatch population’s heights. (It could be due to “sample error.”) So to solve this problem, we first need to: (1) calculate the likely range of means that could expected in a random sample to arise out of the variance in the population alone.
In part 3, with our exploration of the CLT, we learned the precise mathematical relationship between a population’s variance, and the variance, then, in that population’s sample means curve -- that is, in the normal distribution of the means you’d expect to see when taking random samples of a fixed size (n) from the population. We saw how it worked for coin flips, and now it’s time to apply it to our height data.
The entire 3-step logic here (which we’ve seen before) is a simplification of the more formal and precise logic of statistical hypothesis testing, which we’ll now cover in part 4. The basic method is: if the mean of 7.54 we randomly sampled was unlikely enough under the assumption of an 8-foot mean -- unlikely enough according to a predetermined probability threshold -- that the assumption no longer seems tenable, we can reject the idea that the true mean height is 8 feet. So, hypothesis testing is basically a process of giving a claim the benefit of the doubt (assuming it’s true), before seeing if inference from our sample data can rule that claim out.
Before diving in, it’s worth taking a closer look at one point in particular that’s easy to miss. What we are up to is called “parametric” testing, which means: statistical testing that “takes parameters.” That is, our analysis proceeds through data summary measures -- parameters -- of mean and standard deviation, which we use to build a model we can ultimately calculate probabilities from.
But you might have noticed already, in slowly learning the pieces of this method, that there are several symbols floating around for mean and standard deviation (there’s s, σ, μ, x̅), AND 4 distinct distributions which take these different parameters. So it’s crucial to keep them organized in your head.
So let’s recap. These four distributions and their parameters are:
The SMC is of course derived from the population model, and itself models the means of samples taken from the population.
Note that the normal model (and Z-scores) can be applied to any of these distributions IF they are normal. (But, the first three distributions don’t necessarily HAVE TO BE.)
However: the CLT (as we saw) has it that the SMC WILL BE normal (so: number 4 is ALWAYS normal), and this is where we’ll use Z-scores in hypothesis testing. We’ll use them to find the normal probability that a particular sample mean would arise out from the assumed population parameters.
Here’s an illustrated guide to how these 4 different distributions are used in parametric testing.
According to the CLT:
In written steps, here’s the full method for a hypothesis of mean. It might look like a lot, but once we go over each step, it’s not too laborious. Especially, as we’ll see, with technology like the statmagic app to help!
I put the list here for reference. Let’s start with step 1.
So far, we’ve had a working definition of “unlikely” as having < 5% probability. And, along with this, we’ve said there’s a 95% “likely range,” with the leftover “unlikely” probability split evenly into two, 2.5% tails. We’ve also said the Z-score marking the division is +/- 2.
Well, in hypothesis testing, this concept of “unlikely,” converted to a probability threshold, has a formal name: alpha [α]. And, on a normal distribution, 5% in two tails has a more precise cutoff: +/- 1.96.
Alpha, as a capture of what we think an “unlikely” probability is, can thus serve as a “threshold of rejection.” That is, it’s the probability that we judge small enough to rule OUT the idea that chance is acting alone within our modeled assumptions, and therefore it’s what we use to rule IN the idea that our assumption is wrong. The 0.05 we’ve been using is conventional for this.
So: if an observed sample mean (x̅) has a Z-score on the SMC greater than +/- 1.96 (that is, the x̅ is in one of these tails), this means that its population with its assumed center μ had a < 5% chance of furnishing that mean in a random sample. Therefore, if 0.05 was our threshold of rejection (our alpha), we’d logically reject the idea that the population’s assumed center is true after all.
But: this threshold of rejection can be changed to a higher or lower standard. For example, alpha is changed to (a higher standard of) 1% (again, as a total probability for two extremes, or tails, leaving a 99% likely range), this would then required a sample mean to have a Z-score (on the SMC) greater than +/- 2.58.
Why change the standard? Why use a different alpha? That depends on the question asked, and the burden of proof that seems appropriate for the question asked.
This leads us into step 2 in hypothesis testing, which is: determining the hypotheses and tails of the test. Here, after we’ve decided alpha (our overall probability threshold), we next need to spell out the precise event we want to measure the probability of, and compare that to alpha.
So that involved examining our original question a bit more carefully. Because: did we want to know if the true mean height of sasquatches is:
As I’ll show shortly, these three different, more precise questions, change the hypotheses and probabilities involved. So normally, we need to parse our question into one of these 3 more precise options, BEFORE looking at our data.
Otherwise, “different from” (choice a) is most conservative and robust, and what we should use since we already looked at the data and noticed our sample mean is < the assumed mean. But even had we not looked, this would be our choice, since we had no prior reason to suspect that our sample’s mean would ONLY be greater or lesser than the 8 foot assumption.
This precise type of question-forming has a standard logical form in hypothesis testing. It’s called formulating the null and alternative hypotheses (Ho and Ha).
In our question, the null hypothesis is that μ (the true population mean height) equals 8. Ho, that is, states the assumption: the value region that you want to test, rule out, or reject.
Together, Ho and Ha together have to be “complete,” and cover the full logical ground of what μ could be. Because, if you disprove something, you’ve logically proven its complement. If you’ve ruled out Ho, you’ve logically ruled in the complement Ha. So in our example, Ha would be that μ does NOT equal 8.
So again, we have three options. The first one applies to our example, with Ho of μ = 8, which implies an Ha of μ does NOT equal 8. This combination of hypotheses is known as a “2-tailed” test, because a significant sample mean to either side of 8 would disprove the null hypothesis.
In contrast, “1-tailed tests” only test to one side of the mean, leaving a rejection region (and all of alpha) in one tail only.
So changing the precise question asked -- changing the hypotheses of a test -- changes its tails. Let’s try to get a visual of what this means. Say our alpha is 0.01, and let’s compare 2 cases.
Visually, then: changing from 2 down to 1 tail, holding our alpha constant, changes the location of the “likely” and “unlikely” ranges on the SMC. That is, a 1-tailed test shifts all of the “unlikely range” into 1 tail, since a sample mean in the opposite tail no longer disproves our null hypothesis Ho. The question we’re asking now [in the 1/right-tailed case] is only if the true mean is greater than assumed. So, to see a sample mean signaling the opposite over here [in the left tail of a right-tailed test] doesn’t pertain to our question anymore. (It’s the hypotheses that we’ve set up that way.)
So, a 1-tailed test halves the number of outcomes which disprove Ho from 2 to 1. But it doesn’t change our total alpha (our total “disproving” probability). So alpha no longer needs to be split. Now only one outcome -- a sample mean GREATER THAN μ -- can disprove Ho, so it gets a lower burden of proof: a test score of 2.33 marks a 1% tail, instead of two 2.58 Z-scores (both marking 0.5% tails).
This is why, methodically, it’s important to choose a 1 or 2-tailed test BEFORE peeking at the data. Changing to a 1/R-tailed test AFTER noticing the sample mean is greater than μ changes the conditional probability here, and makes for a biased test, given the lower burden of proof.
In a stat class, you might be asked to determine, or choose, the tails of the test from a written question. I admit, it can be tricky to translate a question written in English into one of the 3 Ho-Ha pairs. My tip is: always default to 2-tailed, unless there are keywords like “greater than” or “less than” that clue you in to a 1-tailed test.
With alpha and the hypotheses set, we can put together our population model, with its assumed parameters. First, we set the model’s center to μ.
Next, we need to set the population model’s standard deviation parameter, σ. As we’ve discussed, when we know, or have a number to assume is the population’s standard deviation, we use that number. But, when σ is NOT known (which is most cases), we have to estimate it from the sample with s.
However, this step of estimation introduces a wrinkle into our parametric test that we need to address.. That is: like x̅ as an estimate of the true mean μ, s as an estimate of σ also improves as sample size increases.
So, when using s as an estimate of σ, we need to check n, to see how good s is as an estimate of σ. And if n is too small (conventionally, < 30), we can’t be sure that it’s guaranteeing a good estimate of σ. So in those cases where n is small (< 30) we need to use what’s called a T-test, instead of the Z-test we’ve been using so far.
So what is a T-test, versus a Z-test? These are both types of a hypothesis test of a mean, but they differ based on the type of distribution we use to model the sample means curve (SMC). A T-test uses the T-normal distribution for the SMC, where a Z-test uses the Z-normal. Thus, a sample mean’s position on the T-normal is a T-score. This is just like the Z, except a T-test has something extra called degrees of freedom.
This degrees of freedom parameter of a T-normal distribution gives it “fatter tails” which helps control for the estimation error of using s to estimate σ. Hence a T-test is a more conservative test. But as n increases, the T distribution starts to approximate the Z. (So you’re going to have to take my word for all that, because we’re not going to dive into the math at this point. But the takeaway is:
Where n >=30, we can be sure that we’ve sampled enough so that s is fairly representative of σ. This switch at n => 30 to Z is conventional. (Compare the critical test values [which we’ll explore in more depth shortly] for n = 30, alpha = 0.05, 2-tailed: T*(0.05/2, 29 df) = 2.045 and Z*(0.05/2) [for the same spot on the SMC] = 1.96.)
So when we’re working with small sample sizes, and their “degrees of freedom.” Now you might be wondering, what is a “degree of freedom?” It comes from the fact that we’re estimating variance from another estimate -- the sample’s mean. That is, x̅ is PART OF THE FORMULA for calculating variance.
It’s a subtle topic, but you can think of degrees of freedom (df) as: when you have 3 numbers but they must add up to 10. You can pick any number for the first 2 numbers, but once you have, then the third is then predetermined. So this would be a situation with n - 1 = 3 - 1 = 2 df.
Technical points aside, let’s focus on how we know when to use Z vs. T test. This actually hinges on just one question: do we have a good estimate for the population’s standard deviation?
Let’s take our height data for an example. We don’t know σ (we don’t know the true standard deviation of sasquatch heights), and n = 10 (which is < 30). So these 2 NOs say we have to use a T-test in our sasquatch heights example.
One last thing to note here is that a test of binary data can always use a Z test, since (as we saw in part 3), the variance of binary data is already known (it’s already “hard-coded”) in its proportion of 1s. However, a Z test of a proportion is more reliable when np and n(1-p) are both > 5.
The fourth step in parametric hypothesis testing is to find the critical test value. And this something we’ve actually already seen, in our discussion of alpha and tails. The critical test value (notated with an asterisk) is the Z- or T-score (or scores in 2-tailed tests) that mark where alpha is on the SMC (and therefore the “unlikely vs. likely” ranges are on the sample means curve).
Therefore, a sample mean with a Z or T-score MORE EXTREME on an SMC than a critical value puts that sample mean in the “tail,” or “rejection region” and outside of the “likely range.” This means, it has a probability < alpha (or < alpha / 2 for 2-tailed tests). For example, if our sampled mean sasquatch height, 7.54 feet, ends up having a T-score greater than the absolute T-critical value, this would indicate that 7.54 is an “unlikely” mean to sample, at least under our modeled assumptions.
So, a Z-critical value depends on two things: alpha and tails. And a T-critical (just like a T-score) depends on these, AND the degrees of freedom.
So here’s our illustrations of sample means curves again, where alpha equals 0.01, you can see the Z critical values marking half of alpha in each tail for a 2-tailed test gives you a Z* value of +/- 2.58. And you can see how that changes with tails. Going to a 1-tailed test, where all of alpha is in 1 tail, means your critical value is ONLY (POSITIVE) +2.33.
For the very common alpha of 0.05, for a 2-tailed test, your Z* would be +/- 1.96. And for a 1-tailed test, about +/- 1.65 for the Z*. For any T*, you need to incorporate the degrees of freedom. So to find the critical value for any alpha, tails, and Z or T combination, you’d need to reference a Z or T table, or use a technology like the statmagic app!
Now, it’s time to calculate the test score. This is the T- or Z-score, then, of the observed sample mean on the SMC derived from the modeled population. So let’s rehash the logic of this (this is the logic of hypothesis testing).
We want to find the test score of the sample mean we observed (where it lies on the sample means curve we’re assuming to be true, if the population is as modeled -- if its center really is truly μ). We want to find the test score so that we can find the probability of observing the sample mean that we did. So let’s see this in steps:
Remember, for a T-test, we also need to obtain the T-score’s degrees of freedom. For our sample of 10, we have 10 - 1 = 9 df.
So what do we do with this score? We use it to infer the probability of seeing the sample mean we did, IF the true population mean is as assumed. In other words, just how likely or unlikely were we to sample a mean of 7.54 feet in a sample of 10 sasquatches, if their true mean height really is 8 feet after all?
Once we have a T-score, we know that it has a set probability. But to reach a conclusion with our test, we need to turn this probability into a “yes-or-no.” So step 6 is to evaluate this test score and its probability against alpha, and decide to accept or reject the null hypothesis. To see if we can reject Ho or not, we have two equivalent options. We can:
If the test score of the sample mean is more extreme than the critical value, or its p-value is less than alpha, we can reject Ho.
What does it mean to reject the null hypothesis? We’re saying that -- based on the alpha threshold we set up in advance (plus our faith in the Central Limit Theorem working) -- the sample mean we observed was so unlikely to arise in a random sample of the population AS ASSUMED (if it really did have a center μ), that we now reject the idea the its center is μ. We’re rejecting the idea that μ is the true center of the population.
And since the alternative hypothesis Ha is the logical complement of Ho, when we reject the null hypothesis, we [must accept the alternative and] then conclude that “there is statistical evidence” for Ha. We embrace the alternative hypothesis -- that the true mean either does not equal, is less than, or is greater than, μ (depending on the hypotheses we set up).
Here’s a quick visualization of rejecting Ho by scores comparison.
In this hypothetical test, we had an alpha of 0.05, and ran a Z-test. An alpha of 0.05, 2-tailed, corresponds to a critical Z-value of +/- 1.96. And, our sample mean had a Z-score of 2.13, which was more extreme than the positive critical value. Note that a negative 2.13 (the symmetric result) would have also rejected Ho.
With just scores comparison, we know that getting a test score > an absolute critical value means that the probability of getting that sample mean was < alpha -- but that’s all the more resolution we get.
However, if we have the means (like if we have the statmagic app) to calculate the exact probability, or p-value, of the sample mean’s test score, we can do that, and compare it to alpha directly (to see if we reject Ho).
The benefit of this method is knowing the strength of the evidence -- because the smaller the p-value, the more conclusive the evidence.
Just compare a test with a p-value that says the observed sample mean had a 1-in-20 chance of occurring randomly under the assumptions, versus a p-value that says it was more like a 1-in-1000 chance. The second case is much more conclusive evidence.
Here’s our same hypothetical test as before (with an alpha of 0.05 and a Z-score of 2.13). It turns out that a Z-score of 2.13 for a 2-tailed test has a p-value of 0.033, or about a 1-in-33 chance of this sample mean being random.
The last tricky thing to remember here is that, in 2-tailed tests, we have to “double” our p-values. That is, the probability of a sample mean having a Z-score of +2.13 or greater is 1.66%. But in a 2-tailed test, remember that we are determining the odds that the a sample, taken randomly from the modeled population, would have a mean that is DIFFERENT FROM μ -- NOT JUST THAT MUCH GREATER. So we need to add in the probability of the symmetric result (of the equal but negative Z score) -- and this doubles the p-value in a 2-tailed test.
In this test, the p-value is < alpha, so we would reject Ho (that μ is equal to some value), and conclude by accepting the alternative hypothesis Ha (= that μ is simply NOT equal to that value, seeing how this is a 2-tailed test). This is the end of a typical hypothesis test. But there’s one more very useful thing we can do with our sample data.
Say you’ve run a test, rejected Ho, and concluded then that the assumed population mean is not correct. So, you might wonder, what IS the true population mean then? IT’S NOT x̅, the sample’s mean. Remember, when an Ho is rejected, the Ha you accept in turn only tells you it’s either not equal to, less than, or greater than assumed. It does not tell you exactly what the true mean really is!
However: you can use the sample data to construct a what’s called a confidence interval (“CI”). From the sample data, a CI infers the interval we can be x% sure, or confident, contains the true population mean.
Confidence intervals use the observed sample mean to estimate the true mean, instead of assuming some population mean is true.
Here are the Z and T versions of a confidence interval. With a CI, you’re NOT assuming you know the population’s mean; instead, you’re using the x̅ (the observed sample mean) as the center of an interval for your best estimate of the true mean. You then use a critical value [corresponding to the desired level of confidence] and the standard error, to construct a normal interval around this point estimate x̅. For example, if you wanted a 95% confidence interval, you would use 1.96 for the critical value where a Z-distribution is appropriate (under these conditions [σ is known or n >= 30]).
How does this work? How can we be 95% confident this interval (centered around our sample’s mean) will contain the true mean? Think about our coin flip experiment from part 3, on the CLT.
For a random sample of 100 flips, we found a standard error of 5 tails, which told us that we are 95% likely to see between 40 and 60 tails when the TRUE central tendency is 50. (We got this from: μ +/- 2 * standard error (SE) = 50 +/- 2 * 5.) In other words, only 5% of our sample means in the long run are expected to be outside of this “likely range” of +/- 10 tails (of 39 or less, OR 61 or more).
But, what if, for every random sample of 100 flips, you took its mean (x̅) as the best estimate of the true mean, and constructed an interval of whatever x̅ was, +/- that same 2*SE, that is, +/- 10?
With the simple math here, you can readily see that ONLY those intervals with x̅ centers of 39 or less, or 61 or more, would NOT contain the true mean (of 50 tails out of 100 flips). That is: 39 +/- 10 = [29,49] and 61 +/- 10 = [51, 71]. And these means, crucially as we saw, ONLY HAPPEN 5% OF THE TIME when the true mean is 0.5 (50 tails out of 100 flips).
One final note on hypothesis testing and confidence intervals. A 2-tailed hypothesis test of some alpha, and a (1-alpha) CI can “agree.” That is: if the hypothesis test rejected the assumed μ as the true population mean, then, the CI will NOT contain the assumed μ. On the other hand, if the test failed to reject, the CI WILL contain μ. So whether or not the interval contains μ can be another way of accepting or rejecting Ho.
A couple of graphics should help clarify this point:
Let’s finally solve the sasquatch problem completely. We’ll use the statmagic app to help us visualize the population model, and to take care of all the calculations. We can walk through it using the 6 steps we’ve laid out:
Note how the graphic models the population behind its SMC. As we’ve covered before, these two curves’ relative peakedness around the center shows how strongly the population’s central tendency should come out in the SMC -- that is, how narrow a range the sample mean should obey -- for any random sample of the entered sample size n. For example, this is what it would look like if our sample had been of 100 sasquatches [the SMC narrows more].
Also note how there are two symmetric lines on the SMC: one marking x̅, and the other marking x̅’s twin extreme. This is because this is a 2-tailed test, so we want to find the combined probability of observing a mean AT LEAST AS EXTREME AS x̅ -- that is, to EITHER SIDE OF the assumed mean. So we want the combined area in the two tails marked by these two lines.
Statmagic will do the rest of the steps for us:
Bonus: calculate a confidence interval for the true mean.
Just hit CALC, and expand to see all the results. We have a significant test here, rejecting Woodland Pete’s authoritative claim, with p-value of 0.0437, less than our alpha of 0.05. The T-score for a sample mean of 7.54 feet was -2.3447, beyond the critical T* of +/-2.2622 for 9 df.
Note how the confidence interval (at the bottom) does NOT contain μ (it does not contain 8). The upper bound is just short of 8. It tells us that we can be 95% confident that the true mean height of sasquatches is not 8 feet, but between about [7.10 feet, 7.98 feet ] instead.
And finally, the key points from this part.
We’ve learned that the statistical way we can challenge an assumed mean, having only partial data (only a sample) -- such as the question of the true mean height of sasquatches, which animated this series -- is what parametric hypothesis testing is all about.
And we saw how hypothesis testing makes use of everything we learned in the first three parts:
Thus we can use the fact that the SMC is normal, with a center μ and a standard error of σ / sqrt(n), to infer the probability of observing the sample mean that we did.
And if it was unlikely enough (as predetermined with alpha), we then have statistical evidence to reject the assumed mean μ as the true mean.
Lastly, we learned that, where hypothesis testing can only tell us if the true mean appears equal to, greater than, or less than assumed (depending on the null hypothesis); a confidence interval can give us an estimated range of the true mean using only sample data.