Introduction to Statistics

Part 2: Modeling Data

In this video, we explain:


(Click on a heading to navigate the video to that section)

Welcome to part 2 on modeling data. Here we go.

So what do we mean by modeling?

We mean to adopt an idealized mathematical distribution that we can work with, as opposed to keeping our data as discrete data points which we can’t work with as easily.

And why model data?

Using mathematical models gives us precision and rigor, but using a model means adopting its assumptions, and different models require different assumptions, but most importantly, a model should fit its data well. Because if the data doesn’t fit, the math of the model will still look nice and precise, but the model won’t be giving accurate results. That is, it won’t be giving results that reflect the truth, and that’s what we are really after.

As a quick refresher, here’s the problem again:

The problem is that the latest sample of 10 sasquatch heights doesn’t match up with the accepted, authoritative claim regarding the true mean height.

In this video, we want to explore how we should model the sasquatch population’s heights using a couple of assumptions. This is an important part of getting our answer on what the true height actually is.

And here again are our population and sample data summaries, from these we need two parameters to build the model for our data: the population’s central tendency (mean height) and its spread (the standard deviation of the heights).

First, with the mean, we’re deferring to Woodland Pete’s authority and assuming, until proven otherwise, that the true population mean μ [“mu”] is 8 foot even, like he said.

In fact, as we’ll see, we need to build the population model with his figure in order to prove it otherwise.

Second, with the standard deviation, because we can’t know the true standard deviation, we’re estimating it from our sample. In this, there’s a subtle technical point, but an important one, s is NOT the standard deviation of the sample, strictly speaking, remember sample variance and s use the special n-1 average because they are trying to be estimates of the true population variance and standard deviation. Perhaps s would be better called something like sample estimated standard deviation, but we’ll continue to call it simply sample standard deviation since this is what it’s commonly termed.

So how to choose a model?

To model our population we want to choose the most accurate distribution, I’ve said a model is a mathematical distribution, so what is that?

You can think of a distribution as a map that tells us how frequently any value of x will occur within a population. Mathematically, a distribution is a function which can be graphed: if you give, for example a height, it will spit out what percentage of the population has exactly that height. In other words, how frequently that height occurs within the distribution. Or you can give it a range, and integrating the function, or using an app to do that, you can find the area under the curve which represents the percentage of the population which has a height within that range.

Note that when we model how frequently x occurs within a population this is also the probability that one individual selected at random from that population will have that value x, at least according to the model.

Don’t worry if that’s not perfectly clear yet. We’ll spend some more time on this idea coming up.

How do we go about choosing a good model? That is, a distribution we think accurately represents our population?

Normally, we should plot our sample data, and look for a pattern, but our sasquatch sample is too small to give a good indication. Instead, let’s choose from a distribution lineup:

All of these models map values of x (in our case height values) against their frequency, or how often they show up in the population. In this first distribution, we have a population with no variance, where every sasquatch is exactly 8 feet tall. The height, x, doesn’t vary, and the frequency doesn’t vary. 100% of the population is 8 feet tall. I think our sample rules out this one.

In the second, we have a uniform distribution, where every height in the range of sasquatch heights has an equal frequency. There’s just as many short as there are tall as there are average statured bigfoots. Maybe.

Next we have an exponential distribution. Here there are very many short sasquatches and increasingly less tall ones. This could explain, why there are so few sightings

After that, we have what’s called a normal distribution, where most sasquatches are around the average height, with fewer and fewer at the short and tall extremes. This is how human heights are, of course.

Finally we have a bimodal distribution. Maybe sasquatch heights tend to cluster around two main heights, maybe males and females have two strong but unequal central tendencies.

Let’s go with the idea that sasquatch heights are approximately normally distributed, like human heights and many other things in nature, the weights of apples, the diameters of trees, the volumetric flow of rivers, etc. We’re going to assume sasquatches are about average height with less and less having more extreme heights.

So let’s take a look at the function itself.

This function uses Euler’s number to produce the familiar bell shaped normal curve with two parameters: mean and standard deviation.

Yes, it looks complex, but for introductory stats we won’t need to get into the math too deeply, much less get into calculus to integrate it. I’ve just shared the formula here so that you know where the normal curve gets its shape from.

One important thing to note here is that the normal distribution does not own standard deviation, it just takes it as an input to map x onto this bell shaped frequency. Standard deviation can be calculated for any dataset. Other distributions, like the uniform or the exponential, just map it onto probability differently.

Taking our assumed μ of 8 feet and the standard deviation estimated by the sample, 0.62 feet, here’s an annotated normal curve, depicting our normal model of sasquatch heights:

There’s a way we can simplify the normal model for ourselves, and that’s by using the standard normal distribution.

The SND, as I’m going to call it for my own sake, simplifies things by having a center μ [“mu”] of 0 and a standard deviation σ [“sigma”] of 1. Because, when the function’s inputs take these values, the formula simplifies down to taking just one input, Z. Z then is how many units of standard deviation an x is away from the mean.

Let me explain: because the standard deviation of the SND is 1 unit, converting x into units of standard deviation will tell us where we are on the standard curve. This is in contrast to needing μ and σ to have the general normal function tell us where we are on our specific normal curve.

We can save a lot of calculation trouble just by doing this conversion first and then making use of the SND. This conversion of an x value into sigma units is known as its Z-score.

How do you calculate a Z-score?

Because a Z-score tracks how many standard deviations a value x is away from the mean, you simply find the difference between that value x and μ and then divide by σ.

For example, finding where the 9 foot sasquatches fall on our normal population model, which has a center of 8 and a standard deviation of 0.62 gives us a Z-score of 1.61, that is: (9 - 8)/0.62 = 1.61.

This tells you that where 9 falls on our custom model, is exactly equivalent to where 1.61 falls on the standard normal model (SND).

This is the benefit then of using the standard normal curve, that every Z-score has a fixed spot on it. And so, because the standard “map” doesn’t change, every Z-score has a fixed proportion, or percentage of the population that falls above and below that score.

This is where the Empirical Rule comes from. The Empirical Rule tells you that when values are converted into sigma units, that is Z-scores, approximately 68% of a normal population is within one σ of the mean, 95% percent within 2 σs, and 99.7% within 3.

Of course these are just the major landmarks on the SND. We can also find the proportion and probability of a normal population falling above or below any fractional Z-score or between any two Z-scores.

Notice the leap here: a proportion of the population is also a probability, because a random individual from that population has that chance of coming from that proportion of it. For example, an individual selected at random from a normal population will have a 68% chance of being within 1 σ of the mean. Doing some examples will make this all more clear.

To find what are called Z-normal proportions and probabilities, maybe your stat class has had you look up p-values from a Z-table in the back of your textbook. However, we can more easily use the statmagic app to explore any probability for any normal model.

I’ll show you a quick demo, but first, note that if your are interested in gaining a deeper understanding of why we can always use Z-scores on the standard normal distribution as a substitute for any arbitrary normal distribution, I’ll share that math at the end of this video.

And because I like any excuse for a nice visual, here’s the Empirical Rule illustrated on the standard normal distribution:

Now that we know how to model a normal population, we can use the statmagic app to explore Z-scores and Z-normal probabilities really quickly.

In statmagic’s normal distribution calculator: First make sure you are on x-mode. X-mode is for finding probabilities and proportions that a single random individual from the population has some value x, or in our case, the probability that a single sasquatch has an x height value.

So, enter the population’s assumed mean. μ was 8 feet.

And then the standard deviation. σ was 0.62.

Now you can find probabilities for any range of x, that is height, on the modeled population normal curve.

For example, if you want to see what the normal model says is the probability that the next random sasquatch sighted will have a height between 6’ 9” and 9’3”, you can enter 6’9” for the lower bound, which is 6.75 feet for the lower bound and 9’3” which is 9.25 for the upper bound, and then hit calc.

The probability that the Z-scores are between those values is 0.9562.

See how this range of heights, with Z-scores a little more than +-2 covers a little over 95% of the sasquatch population? Just like the Empirical Rule would suggest.

If you want to see the probability that the next sasquatch sighted has a height just greater than, say, 9 feet. Just change the toggle to greater than, change that first bound to 9, and hit calc.

Note the Z-score for 9 foot sasquatches is 1.6129 and the proportion of sasquatches over 9 feet then is just over 5%.

You can also prove with this calculator that using the standard normal distribution and Z-scores works. Change the curve parameters to those of the standard normal (a μ of 0 and a σ of 1) and then instead of 9, use the Z-score for 9 foot sasquatches (1.6129), hit calc, and notice that it’s the same probability.

And that’s a wrap on this part on modeling data.

Let’s review the key points:

We learned in Part 1 that when we can’t measure a population directly or feasibly, like we can’t measure the sasquatch population, we have to measure data using a random sample, which we can use to estimate any missing population parameters.

In this part, we learned about normal distributions, and how standardizing, that is using Z-scores, simplifies the task of using normal models to find population proportions and probabilities.

Now that we have a model, we can start looking at how to use that model to infer what’s true about the sasquatch population. To infer what the true mean height really is. Because remember, we’ve taken Woodland Pete’s number and built it into our model. We’ve assumed that the true mean sasquatch height is 8 feet, at least until we can prove it otherwise.

Of course the sample of 10 sasquatches estimated a different mean height: 7.54ft.

So, as we’ve asked before, how can we tell if that sample mean is signalling to us that our model’s assumed mean is wrong? Or: Could that 7.54 be still within the likely sampling error that you get whenever you sample from a population with variance? (In our case, a population with center 8 and a standard deviation 0.62)

In Part 3 we’ll explore what’s called the Central Limit Theorem, and how it is what allows us to infer the range of means that is likely for a random sample taken from a population with variance.

That wraps up this part, but you can hang on if you are interested in seeing some more of the math of why it works to convert to the standard normal distribution.

In this bonus section I’m going to show you the math that let’s us substitute the standard normal curve and its Z-scores for any other normal curve.

We’ve seen this boxed function before. This is the one that maps, or draws the normal curve. That is, it takes x values and mean and standard deviation as inputs and outputs a normal frequency.

This function is technically known as the normal probability density function (PDF).

But, if we want to use Z-scores instead, we can do a simple substitution for x:

So here we have the equation for Z-score, which can be rearranged solving for x. We can then substitute into the function x in terms of σ Z and μ and then into the PDF as well.

And this all simplifies down to this:

Notice that there’s still σ here in the denominator. This is the general form of the PDF that still takes the Z-score instead of x and it needs to be scaled for σ, but if we restrict ourselves to using Z-scores only on the standard normal distribution you know σ equals 1, and therefore, σ can be taken out of the equation:

This standardized form of the normal probability density function gets its own special notation with the greek letter Φ ["phi"].

The standard normal probability density function can be used to calculate the exact frequency of any x using that x’s Z-score, but in statistics asking what percentage or frequency of sasquatches are exactly 9 feet tall is a much less common question than: How many sasquatches are at least 9 feet tall?

This second kind of question, involving a range of values, would require summing up all the PDF of x’s for every possible x, 9 feet and greater, at an infinitesimally small interval, which is of course, a calculus move.

So finding this area under the curve then requires an integral of the PDF. This integral is known as the normal distribution’s cumulative density function, the normal CDF:

With the CDF we can simplify things too, again using the standard normal curve as a substitute for integrating any arbitrary normal curve.

Here’s the math to show it.

Again the boxed equation is the integral of the normal PDF over the range x1 to x2.

We can start things off again by rearranging the Z-score equation and substituting for x. This is just like what we found in the PDF, but now note the bounds have been converted and of course note with an integral that dx pops up.

So, knowing x in terms of the Z-score we can derive dx, which is σ dz.

So in our effort to reduce this to taking one variable Z we can substitute σ dz for dx. Notice the bounds are just Z-scores too. And this finally resolves to this:

Where the σ in front of the dz cancels out the σ in the denominator, and now you have an integral taking only Z-scores.