HOMEPAGE LEARN STATS APP HELP CONTENT FEEDBACK FORM CONTACT

Introduction to Statistics

Part 1: Measuring Data

In this video, we explain:

VIDEO TRANSCRIPT

(Click on a heading to navigate the video to that section)

Welcome to part one of this introduction to statistics on measuring data. Let’s jump right in.

Let’s start with the problem.

Say you want to know: How tall are sasquatches?

The most exhaustive way to answer this question would be to measure every sasquatch, but you can’t do that. You can’t measure the sasquatch population in its entirety. Given how elusive they are, you would never know if you had measured them all.

So you do some research instead. The most authoritative figure you find on the subject is 8 foot even, which comes from Woodland Pete in a 1908 editorial to the Medford Mail-Tribune. But, not to just take his authority on this, you do some more digging, and you find that the last 10 random sightings of sasquatches give a mean (arithmetic average) of 7.5 feet.

So, we have conflicting numbers. What do we do?

Our question becomes: Is this latest sample data enough evidence to reject and revise Woodland Pete’s claim? This is a perfect example of the kind of problem that statistics allows us to solve: it helps us get strong conclusions out of partial knowledge.

Deciding whether we should defer to Woodland Pete’s authority or revise it, will depend on analyzing a sample, which is of course, partial knowledge of a population.

This question is going to animate our series, so first things first: let’s take a look at the raw sample data that you found.

This is a table of just facts about random individuals. So for our question, this isn’t useful yet, because we want to know a fact about the population.

We need to decide on a summary measure, that is, a descriptive statistic, which will capture what we want to know: the fact of the population’s average height.

Before we go ahead and start crunching numbers, it’s important to consider for a second which of measure of data we ought to use. How can we best translate what we want to know about a population into a number?

We’ve said we want to know the average height of a sasquatch, but by average what do we need?

We should be aware that there are multiple measures of average. We’ve implied that we are using the most familiar of these: the arithmetic mean, where you add up all the values and divide by their count. But there are other averages (average meaning a measure of central tendency).

I mentioned the mean, there’s also the median, which if you order all the values by magnitude, means taking the center value. There’s proportion, which is just the mean of binary data (a dataset that’s all 1’s and 0’s). There are others: The mode, or the most repetitive number, the geometric mean which involves the square root of the product of all the values, etc.

The important note here is to be aware that already we are making a decision on how we are going to mathematically model the fact of the population that we want to find out.

We should note that each of these measures of central tendency comes with possible bias. Arithmetic mean for instance, can be greatly affected by outliers, more affected than, say, a median might be, but arithmetic mean comes with great advantages, namely, easy mathematical manipulation.

Back to our problem:

You know that you don’t have access to the objective fact you are trying to measure. You can’t measure the true population height, because you can’t measure the whole population. The upshot is, you are stuck with your sample. You only have your sample data to work with.

So how does sampling data behave? How accurately can it reflect its population?

Focusing on how accurately the sample mean can mirror the population mean, we already know a couple of things intuitively:

  1. If your sample is big enough, its mean will be pretty close to the true mean. When it comes to trying to estimate something, more data is better than less data.
  2. We already know that a sample should be random. A non-random sample risks bias. It won’t be representative of its population. For example, in our problem, accidentally oversampling from a relatively short sasquatch family could bias the population mean estimate.

So sampling can be more or less accurate depending on these issues, but let’s ask a simple question we haven’t talked about yet:

Remember that Woodland Pete’s estimate for the mean height was 8 feet, while the sample data gave a mean height of 7.5 feet.

Doesn’t that alone prove that one of these numbers is wrong?

Well let’s consider the case of flipping a coin 100 times. We’d expect to get 50 heads and 50 tails, but what if we got 52 tails?

This doesn’t prove that the coin is biased. And why is that? It’s because there’s random variation, two possible outcomes, not just one, out of a fair coin (Heads and Tails), that can give rise to sample error. We wouldn’t expect to always, or even probably, get exactly 50 tails out of 100 flips because of that.

But what about 60 or 75 tails? The larger the difference between the observed result and the expected result, the more suspicious we are that it wouldn’t be a fair coin. But to prove bias conclusively, that is mathematically, we need a precise measure of just how like likely 60 or 75 tails is.

Then, if it’s so improbable for a fair coin to give this sample result, we reject the idea that it’s a fair coin after all.

The important thing to note for now is, when you observe a difference between an assumed population mean and a sample mean, that difference alone is not yet enough to say one of them is wrong. You also need to have a measure of the variation in the population data, because this variation, that is, the possibility of multiple outcome, is what can create inaccuracy, or error, in the sample’s mean.

Let’s pause on this term. What do I mean by error?

Error is the fact that when you’re sampling from a population, the sample’s mean is usually not exactly equal to the population’s true mean.

Consider, if you were sampling from a population with no variation. That is, one that had only one possible outcome. For instance, if all sasquatches are exactly 8 feet tall, the population mean is 8 feet, and then any sample of any size taken from this population could only be exactly 8 feet too. There would be no error in the sample mean.

But as soon as the population varies, as soon as there is more than one possible height for a sasquatch to have, the sample mean might not match up exactly to the population’s true mean. Error is introduced.

Data with variation creates error in the sample mean, and more variation means more error.

Let’s think about this:

Let’s assume the mean height of sasquatches is 8 feet. Looking at a case with little variation: let’s say 95% of sasquatches are between 7.9 and 8.1 feet tall, all very close to the mean of 8 feet. Then, if we obtained a sample mean of 7.5 feet, it’s odd, because this is way outside of the natural variation in heights and seems like a bigger difference than the sampling error could explain alone.

On the other hand, in a case with large variation in heights, say 95% of sasquatches are between 6 and 10 feet tall, the mean might still 8 feet, and observing a sample mean of 7.5 feet, is still less, it’s the same difference from the mean, but the greater range of values inherent in the sasquatch population makes sampling a mean of 7.5 unremarkable. In this case, it’s likely sampling error is responsible for the difference.

So now we need to know how to measure the variation in sasquatch heights. That is, what’s the average we should use here to measure the spread of the sasquatch population’s heights around their mean?

Because since, again, data with variation creates error in the sample mean, we want to use this measure of data variation to calculate how much error is likely for a sample mean.

This variation, or multiple outcomes in data we’ve been talking about has a formal name: variance.

Variance measures the spread of data around it’s mean.

It’s very important in understanding statistics, so make sure you feel you have a good understanding of this slide before moving on.

Variance is the measure we use to model the spread of data around its mean. It’s the average we use to capture how spread out the data is with one number.

Why is it important? When we can measure data spread we can ultimately infer the amount of error to expect in the means of samples taken from that data.

We saw on the previous slide how under the same assumed population mean of 8 feet getting a new sample mean of 7.5 could either be unremarkable or extremely unlikely all depending on the variance that was inherent in the sasquatch population’s heights. It all depends on the amount of variance.

So there are actually two formulas for variance one for population data and one for sample data.

Populuation Variance:
Sample Variance:

Let’s break down what’s going on in these formulas. First let’s look at what’s the same between the two.

Looking at the numerator, the first thing you need is x̅ ["x-bar"]. That’s the mean of the dataset. So to calculate the numerator you use every individual data point (That’s i from 1 to n, which means however many datapoints there are), you subtract off the mean and the difference between that datapoint value and the mean value, and you square it, and then you sum them all up.

Let’s pause here. You might be wondering why does variance use the squared differences between each data point and the data’s mean? Well, foremost, we use it for easier math down the line, but we can also note that is keeps all values positive so that they don’t cancel out. That is, values above the mean that would give a positive difference don’t cancel out with values below the mean which would give a negative difference if you didn’t square them. Finally, we can note that for better or for worse, using squared differences to calculate variance has the consequence that outliers are going to be weighted exponentially.

The second thing to note about variance is the averaging. Because the numerator first sums up all of these squared differences and then divides by n or n-1, we are getting an average. So, variance is really the average squared difference the data displays around its mean.

Now, what makes these two formulas different?

The first thing to note is the different symbols. Population variance is notated by σ² ["sigma squared"] and sample variance by s² ["s squared"]. They are both squared because they are the average squared difference of the data around its mean. Because we always need to be careful if we are talking about a population or a sample, that is, we want to state a fact about a whole or state a fact about a partial bit of it, we use these different symbols to keep those two ideas separate.

The second difference to note is the different denominators. Population variance finds its average by dividing by n, like you would expect, but sample variance divides by n-1.

Why?

Remember that in our example, we can’t calculate the sasquatch population’s mean height because we couldn’t measure every sasquatch. The population variance is similarly inaccessible and unmeasurable. We can’t measure every member. So when we can’t know the population’s variance, the best we can do is to try to estimate it from what’s observed in the sample. For mathematical reasons, using n-1, called Bessel’s correction, will give a better estimate of the population's variance when estimating it from a sample.

Finally, as a note for the mathematically minded out there, as you go further in statistics, note that you will eventually see more consequences from 1.) The use of s² as an estimate of σ², especially at low n or sample size, and 2.) The fact that variance itself is an estimate calculated from another estimate, that is, the mean, which is an estimate of the population’s center.

You might have already heard of standard deviation, which is another measure of the spread of data around its mean. Standard Deviation is in “unsquared” units, that is, on the same scale as the data itself.

Standard Deviation is simply the square root of variance. Again keeping population and sample separate, here are the formulas:

Population Standard Deviation:
Sample Standard Deviation:

Because it’s on the same scale as the data, this means that we often use standard deviation when we want a number to summarize the spread of the dataset. But you can see how easy it is to back and forth between standard deviation and variance just by squaring and unsquaring.

Mathematically, one important thing to note is that if you need to manipulate standard deviations, such as add two of them together, you’ll need to go through variance. You’ll need to square the standard deviations into variances, sum them, then take the square root of that sum.

Now that we understand variance and standard deviation a little more, let’s revisit our sample data.

In the first column we have the original 10 observations, and remeasuring the center of our sample, we get a sample mean, x̅, of 7.54 feet. Now, let’s measure the variance in our sample.

Here’s the formula for sample variance again:

In the second column we have each individual datapoint’s squared difference from the mean. For example, the first observation, 7.6 feet, minus the sample mean, x̅, 7.54 feet, squared, gives you 0.0036.

The average of all of these squared differences, the n-1 average (because this is a sample variance we are calculating), gives you 0.3849 feet squared (The sample variance being a squared figure).

From here it’s easy to calculate sample standard deviation just by taking the square root of the variance, which gives 0.6204 feet.

Now how do we interpret this number? Sample standard deviation means that this is the average number of feet away sasquatches are from the mean height according to this sample.

Now let’s summarize all of our population and sample data. Here’s what we know about the population, which is actually nothing, but we are deferring to Woodland Pete’s authority and assuming, until proven otherwise, that the true population mean is 8 feet even, like he said. Note that the population mean is denoted by the greek symbol μ [“mu”].

Here are all the measures of our sample. We have the mean, variance, standard deviation, and count. Note that standard deviation and variance are taken out to so many decimals because if we are to use this in an equation somewhere any intermediate rounding we do now could throw off the final answers calculated from these.

Here’s how you can quickly calculate sample statistics like these on the statmagic app. In the Descriptive Stats calculator make sure the toggle is on s mode to calculate stats for sample data σ [“sigma”] mode is for those rare cases where you have complete population data.

Then, enter the 10 data points, choose your desired final answer rounding, four is fine for this, and hit the calculate button. You can then browse the output pages or lower the keypad to see all results at once.

So here are the key points from this first part of this introduction to statistics:

When we cannot measure population completely, like we can’t measure the whole population of sasquatches, we can then only measure it using a random sample. But sample data can only estimate the population’s true parameters, we can only get estimates of the true mean and the true variance and standard deviation within the population.

So the fact that we have to sample has an important consequence when it comes to estimating the true mean of a population:

  1. When a population has variance, this introduces error, or inaccuracy, into the means of the samples taken from it.
  2. Therefore, if you see a difference between an assumed population mean and a sample mean, like we did in our example, this could just be due to the error in sample estimation. So this difference, in itself, does not prove the original assumption wrong, unless...
  3. The magnitude of this difference is larger than the error that is likely explained by the variance in the population data alone.

From here, we want to know, how can we calculate the expected error that we get when sampling (that is, when using a sample mean to estimate the true mean)?

In other words, how can we calculate an expected range of means we are likely to see in a random sample given that the population we are sampling from has variance? We want to know so that we can tell if a mean is or is not likely due to just sampling error.

To estimate this likely range of means we will first decide how to model our population. At this point, we turn from measuring data towards making more data modeling decisions. Just like we had to decide which average we should use to model the central tendency of sasquatch heights, we will now decide how to model the population as a whole. We’ll use A. whatever is already known or assumed about the population, plus B. statistics gained from our random sample which we will use to fill in any missing population assumptions.

So what model should we use to model the heights of sasquatches? Check out Part 2 to see.