1.4 Density Curves and Normal Distributions in Chapter 1 Looking at Data

1.4 Density Curves and Normal Distributions

When you complete this section, you will be able to:

Sketch a Normal distribution for any given mean and standard deviation.
Apply the 68–95–99.7 rule to find proportions of observations within one, two, and three standard deviations of the mean for any Normal distribution.
Find the z -score for any observation x .
Compute areas under a Normal curve using software or Table A.
Perform inverse Normal calculations to find values of a Normal variable corresponding to various areas.
Assess the extent to which the distribution of a set of data can be approximated by a Normal distribution.

We now have a kit of graphical and numerical tools for describing distributions. What is more, we have a clear strategy for exploring data on a single quantitative variable:

Always plot your data: make a graph, usually a stemplot or a histogram.
Look for the overall pattern and for striking deviations such as outliers.
Calculate an appropriate numerical summary to briefly describe center and spread.

Technology has expanded the set of graphs that we can choose for Step 1. It is possible, though painful, to make histograms by hand. Using software, clever algorithms can describe a distribution in a way that is not feasible by hand, by fitting a smooth curve to the data in addition to or instead of a histogram. The curves used are called density curves. Before we examine density curves in detail, here is an example of what software can do.

Example 1.32 Density curve for times to start a business.

Data set icon for tts186.

Figure 1.19 illustrates the use of a density curve along with a histogram to describe distributions. It shows the distribution of the times to start a business for 186 countries (see Example 1.19, page 26). The outlier, Venezuela, described in Check-in question 1.16 (page 27), has been deleted from the data set. The distribution is highly skewed to the right. Most of the data are in the first several classes, with 50 or fewer days to start a business, but there are a few countries with very large start times.

A histogram with a density curve plots count versus time to start in days, with a class size of 10. — Figure 1.19 The distribution of 186 times to start a business, Example 1.32. Venezuela, the outlier, has been eliminated from this plot. The distribution is pictured with both a histogram and a density curve. This distribution has a single mode with a long tail.

A smooth density curve is an idealization that gives the overall pattern of the data but ignores minor irregularities. We first discuss density curves in general and then focus on a special class of density curves, the bell-shaped Normal curves.

Density curves

One way to think of a density curve is as a smooth approximation to the irregular bars of a histogram. Figure 1.20 shows a histogram of the scores of all 947 seventh-grade students in Gary, Indiana, on the vocabulary part of the Iowa Test of Basic Skills. Scores of many students on this national test have a very regular distribution. The histogram is symmetric, and both tails fall off quite smoothly from a single center peak. There are no large gaps or obvious outliers. The curve drawn through the tops of the histogram bars in Figure 1.20 is a good description of the overall pattern of the data.

Two histograms of grade equivalent vocabulary scores. — Figure 1.20 (a) The distribution of Iowa Test vocabulary scores for Gary, Indiana, seventh-graders, Example 1.33. The shaded bars in the histogram represent scores less than or equal to 6.0. (b) The shaded area under the Normal density curve also represents scores less than or equal to 6.0. This area is 0.293, close to the true 0.303 for the actual data.

Example 1.33 Vocabulary scores.

In a histogram, the heights of the bars represent either counts or proportions of the observations. In Figure 1.20(a), we shaded the bars that represent students with vocabulary scores 6.0 or lower. There are 287 such students, who make up the proportion 287 / 947 = 0.303 of all Gary seventh-graders. The shaded bars in Figure 1.20(a) make up proportion 0.303 of the total area under all the bars. If we adjust the scale so that the total area of the bars is 1, the area of the shaded bars will also be 0.303.

In Figure 1.20(b), we shaded the area under the curve to the left of 6.0. If we adjust the scale so that the total area under the curve is exactly 1, areas under the curve will then represent proportions of the observations. That is, a r e a = p r o p o r t i o n . The curve is then a density curve. The shaded area under the density curve in Figure 1.20(b) represents the proportion of students with score 6.0 or lower. This area is 0.293, only 0.010 away from the histogram result. You can see that areas under the density curve give quite good approximations of areas given by the histogram.

The density curve in Figure 1.20 is a Normal curve. Density curves, like distributions, come in many shapes. Figure 1.21 shows two density curves: a symmetric Normal density curve and a right-skewed curve.

Two density curves, one normal and one right skewed. — Figure 1.21 (a) A symmetric Normal density curve with its mean and median marked. (b) A right-skewed density curve with its mean and median marked.

We will discuss Normal density curves in detail in this section because of the important role they play in statistics. There are, however, many applications where the use of other families of density curves are essential.

A density curve of an appropriate shape is often an adequate description of the overall pattern of a distribution. Outliers, which are deviations from the overall pattern, are not described by the curve.

Measuring center and spread for density curves

Our measures of center and spread apply to density curves as well as to actual sets of observations, but only some of these measures are easily seen from the curve. A mode of a distribution described by a density curve is a peak point of the curve, the location where the curve is highest. Because areas under a density curve represent proportions of the observations, the median is the point with half the total area on each side. You can roughly locate the quartiles by dividing the area under the curve into quarters as accurately as possible by eye. The IQR is the distance between the first and third quartiles. There are mathematical ways of calculating areas under curves. These allow us to locate the median and quartiles exactly on any density curve.

What about the mean and standard deviation? The mean of a set of observations is their arithmetic average. If we think of the observations as weights strung out along a thin rod, the mean is the point at which the rod would balance. This fact is also true of density curves. The mean is the point at which the curve would balance if it were made out of solid material. Figure 1.22 illustrates this interpretation of the mean.

Three right skewed density curves and their horizontal axes are balanced on the top vertex of a triangle. When the vertex is left or right of the mean, the curve falls in the direction of the mean. When the vertex is at the mean, the curve is balanced. — Figure 1.22 The mean of a density curve is the point at which it would balance.

A symmetric curve, such as the Normal curve in Figure 1.21(a), balances at its center of symmetry. Half the area under a symmetric curve lies on either side of its center, so this is also the median.

For a right-skewed curve, such as those shown in Figures 1.21(b) and 1.22, the small area in the long right tail tips the curve more than the same area near the center. The mean (the balance point), therefore, lies to the right of the median. It is hard to locate the balance point by eye on a skewed curve. There are mathematical ways of calculating the mean for any density curve, so we are able to mark the mean as well as the median in Figure 1.21(b). The standard deviation can also be calculated mathematically, but it can’t be located by eye on most density curves.

A density curve is an idealized description of a distribution of data. For example, the density curve in Figure 1.20 (page 47) is exactly symmetric, but the histogram of vocabulary scores is only approximately symmetric. We therefore need to distinguish between the mean and standard deviation of the density curve and the numbers x ¯ and s computed from the actual observations. The usual notation for the mean of an idealized distribution is μ (the Greek letter mu). We write the standard deviation of a density curve as σ (the Greek letter sigma). In Chapter 5, we refer to x ¯ and s as statistics associated with a sample and to μ and σ as parameters associated with a population.

Normal distributions

One particularly important class of density curves has already appeared in Figures 1.20 and 1.21(a). These density curves are symmetric, unimodal, and bell-shaped. They are called Normal curves, and they describe Normal distributions. All Normal distributions have the same overall shape.

The exact density curve for a particular Normal distribution is specified by giving the distribution’s mean μ and its standard deviation σ . The mean is located at the center of the symmetric curve and is the same as the median. Changing μ without changing σ moves the Normal curve along the horizontal axis without changing its spread.

The standard deviation σ controls the spread of a Normal curve. Figure 1.23 shows two Normal curves with different values of σ . The curve with the larger standard deviation is more spread out.

Two normal curves. Both have a mean labeled mu and a deviation marked to the right of mu labeled sigma. The first curve is shorter and wider, with a larger deviation, sigma sub 1. The second is taller and narrower with a smaller deviation, sigma sub 2. — Figure 1.23 Two Normal curves, both showing the same mean μ but with differing standard deviations σ 1 and σ 2 .

The standard deviation σ is the natural measure of spread for Normal distributions. Not only do μ and σ completely determine the shape of a Normal curve, but we can locate σ by eye on the curve. Here’s how. As we move out in either direction from the center μ , the curve changes from falling ever more steeply

A normal distribution curve. Standard deviations on either side of the mean are represented by gaps in the curve, showing the points where the curve changes from falling more steeply to less steeply.

The points at which this change of curvature takes place are located at distance σ on either side of the mean μ . You can feel the change as you run your finger along a Normal curve and so find the standard deviation. caution Remember that μ and σ alone do not specify the shape of most distributions and that the shape of density curves in general does not reveal σ . These are special properties of Normal distributions.

There are other symmetric bell-shaped density curves that are not Normal. The Normal density curves are specified by a particular equation. The height of the density curve at any point x is given by

1 σ 2 π e − 1 2 ( x − μ σ ) 2

We will not make direct use of this fact, although it is the basis of mathematical work with Normal distributions. Notice that the equation of the curve is completely determined by the mean μ and the standard deviation σ .

Why are the Normal distributions important in statistics? Here are three reasons:

Normal distributions are good descriptions for some distributions of real data. Distributions that are often close to Normal include scores on tests taken by many people (such as the Iowa Test of Figure 1.20, page 47), repeated careful measurements of the same quantity, and characteristics of biological populations (such as lengths of baby pythons and yields of corn).
Normal distributions are good approximations to the results of many kinds of chance outcomes, such as tossing a coin many times.
Many statistical inference procedures based on Normal distributions work well for other roughly symmetric distributions.

caution However, even though many sets of data follow a Normal distribution, many do not. Most income distributions, for example, are skewed to the right and so are not Normal. Non-Normal data, like nonnormal people, not only are common but are also sometimes more interesting than their Normal counterparts.

The 68–95–99.7 rule

Although there are many Normal curves, they all have common properties. Here is one of the most important.

The 68–95–99.7 rule

In the Normal distribution with mean μ and standard deviation σ :

Approximately 68% of the observations fall within σ of the mean μ .
Approximately 95% of the observations fall within 2 σ of μ .
Approximately 99.7% of the observations fall within 3 σ of μ .

Figure 1.24 illustrates the 68–95–99.7 rule. By remembering these three numbers, you can think about Normal distributions without constantly making detailed calculations.

A normal distribution curve with percent of observations labeled. — Figure 1.24 The 68–95–99.7 rule for Normal distributions.

Example 1.34 Heights of young women.

The distribution of heights of young women aged 18 to 24 is approximately Normal with mean μ = 64.5 inches and standard deviation σ = 2.5 inches. Figure 1.25 shows what the 68–95–99.7 rule says about this distribution.

Two standard deviations equals five inches for this distribution. The 95 part of the 68–95–99.7 rule says that the middle 95% of young women are between 64.5 − 5 and 64.5 + 5 inches tall—that is, between 59.5 and 69.5 inches. This fact is exactly true for an exactly Normal distribution. It is approximately true for the heights of young women because the distribution of heights is approximately Normal.

The other 5% of young women have heights outside the range from 59.5 to 69.5 inches. Because the Normal distributions are symmetric, half of these women are on the tall side. So the tallest 2.5% of young women are taller than 69.5 inches.

A normal distribution curve of height in inches with percent of observations labeled. — Figure 1.25 The 68–95–99.7 rule applied to the heights of young women, Example 1.34.

Because we will mention Normal distributions often, a short notation is helpful. We abbreviate the Normal distribution with mean μ and standard deviation σ as N ( μ , σ ) . For example, the distribution of young women’s heights is N ( 64.5 , 2.5 ) .

Check-in

1.29 Test scores. Many states assess the skills of their students in various grades. One program that is available for this purpose is the National Assessment of Educational Progress (NAEP).²⁸ One of the tests provided by the NAEP assesses the mathematics skills of eighth-grade students. In a recent year, the national mean score was 282, and the standard deviation was 40. Assuming that these scores are approximately Normally distributed, N ( 282 , 40 ) , use the 68–95–99.7 rule to give a range of scores that includes 95% of these students.
1.30 Use the 68–95–99.7 rule. Refer to the previous Check-in question. Use the 68–95–99.7 rule to give a range of scores that includes 99.7% of these students.

Standardizing observations

As the 68–95–99.7 rule suggests, all Normal distributions share many properties. In fact, all Normal distributions are the same if we measure in units of size σ about the mean μ as center. Changing to these units is called standardizing. To standardize a value, subtract the mean of the distribution and then divide by the standard deviation.

Standardizing and z-scores

If x is an observation from a distribution that has mean μ and standard deviation σ , the standardized value of x is

z = x − μ σ

A standardized value is often called a z -score.

A z -score tells us how many standard deviations the original observation falls away from the mean, and in which direction. Observations larger than the mean are positive when standardized, and observations smaller than the mean are negative.

To compare scores based on different measures, z -scores can be very useful. For example, see Exercise 1.85 (page 65), where you are asked to compare an SAT score with an ACT score.

Example 1.35 Find some z-scores.

The heights of young women are approximately Normal with μ = 64.5 inches and σ = 2.5 inches. The z -score for height is

z = height − 64.5 2.5

A woman’s standardized height is the number of standard deviations by which her height differs from the mean height of all young women. A woman 68 inches tall, for example, has z -score

z = 68 − 64.5 2.5 = 1.4

or a height that is 1.4 standard deviations above the mean. Similarly, a woman 5 feet (60 inches) tall has z -score

z = 60 − 64.5 2.5 = − 1.8

or a height that is 1.8 standard deviations less than the mean.

Check-in

1.31 Find the z-score. Consider the NAEP scores (see Check-in question 1.29, page 52), which we assume are approximately Normal, N ( 282 , 40 ) . Give the z -score for a student who received a score of 300.
1.32 Find another z-score. Consider the NAEP scores, which we assume are approximately Normal, N ( 282 , 40 ) . Give the z -score for a student who received a score of 200. Explain why your answer is negative even though all the test scores are positive.

We need a way to write variables, such as “height” in Example 1.34, that follow a theoretical distribution such as a Normal distribution. We use capital letters near the end of the alphabet for such variables. If X is the height of a young woman, we can then shorten “the height of a young woman is less than 68 inches” to “ X < 68 .” We will use lowercase x to stand for any specific value of the variable X .

We often standardize observations from symmetric distributions to express them in a common scale. We might, for example, compare the heights of two children of different ages by calculating their z -scores. The standardized heights tell us where each child stands in the distribution for his or her age group.

Standardizing is a linear transformation that transforms the data into the standard scale of z -scores. We know that a linear transformation does not change the shape of a distribution and that the mean and standard deviation change in a simple manner. In particular, the standardized values for any distribution always have mean 0 and standard deviation 1.

If the variable we standardize has a Normal distribution, standardizing does more than give a common scale. It makes all Normal distributions into a single distribution, and this distribution is still Normal. Standardizing a variable that has any Normal distribution produces a new variable that has the standard Normal distribution.

The standard normal distribution

The standard Normal distribution is the Normal distribution N ( 0 , 1 ) with mean 0 and standard deviation 1.

If a variable X has any Normal distribution N ( μ , σ ) with mean μ and standard deviation σ , then the standardized variable

Z = X − μ σ

has the standard Normal distribution.

Normal distribution calculations

Areas under a Normal curve represent proportions of observations from that Normal distribution. There is no formula for areas under a Normal curve. Calculations use either software that calculates areas or a table of areas. The table and most software calculate one kind of area: cumulative proportion, which is the proportion of observations in a distribution that lie at or below a given value. When the distribution is given by a density curve, the cumulative proportion is the area under the curve to the left of a given value. Figure 1.26 shows the idea more clearly than words do.

A normal distribution curve. Value x is marked to the left of the mean. The area under the curve to the left of x is shaded. The cumulative proportion at x = the area under curve to the left of x. — Figure 1.26 The cumulative proportion for a value x is the proportion of all observations from the distribution that are less than or equal to x . This is the area to the left of x under the Normal curve.

The key to calculating Normal proportions is to match the area you want with areas that represent cumulative proportions. Then get areas for cumulative proportions either from software or (with an extra step) from a table. The following examples show the method in pictures.

Example 1.36 The NCAA standard for SAT scores.

The National Collegiate Athletic Association (NCAA) requires Division I athletes to get a combined score of at least 820 on the SAT Mathematics and Verbal tests to compete in their first college year.²⁹ (Higher scores are required for students with poor high school grades.) The scores of the 1.4 million students who took the SATs were approximately Normal with mean 1026 and standard deviation 209. What proportion of all students had SAT scores of at least 820?

Here is the calculation in pictures: the proportion of scores above 820 is the area under the curve to the right of 820. That’s the total area under the curve (which is always 1) minus the cumulative proportion up to 820. Note that we have used software for these calculations.

Three normal distribution curves illustrate a calculation.

area right of 820 = total area − area left of 820 0.8378 = 1 − 0.1622

Thus, the proportion of all SAT test-takers who would be NCAA qualifiers is 0.8378, or about 84%.

There is no area under a smooth curve that is exactly over the point 820. Consequently, the area to the right of 820 (the proportion of scores > 820 ) is the same as the area at or to the right of this point (the proportion of scores ≥ 820 ) . The actual data may contain a student who scored exactly 820 on the SAT. That the proportion of scores exactly equal to 820 is 0 for a Normal distribution is a consequence of the idealized smoothing of Normal distributions for data.

Example 1.37 Partial qualifiers.

The NCAA considers a student to be a “partial qualifier”—eligible to practice and receive an athletic scholarship, but not compete—if the combined SAT score is at least 720.³⁰ What proportion of all students who take the SAT would be partial qualifiers? That is, what proportion have scores between 720 and 820? Here are the pictures:

area between 720 and 820 = area left of 820 − area left of 720 0.0906 = 0.1622 − 0.0716

About 9% of all students who take the SAT have scores between 720 and 820.

How do we find the numerical values of the areas in Examples 1.36 and 1.37? If you use software, just plug in mean 1026 and standard deviation 209. Then ask for the cumulative proportions for 820 and for 720. (Your software will probably refer to these as “cumulative probabilities.” We will learn in Chapter 4 why the language of probability fits.) Sketches of the areas that you want similar to the ones in Examples 1.36 and 1.37 are very helpful in making sure that you are doing the correct calculations.

Applet You can use the Normal Curve applet on the text website to find Normal proportions. The applet is more flexible than most software—it will find any Normal proportion, not just cumulative proportions. The applet is an excellent way to understand Normal curves. But, because of the limitations of web browsers, the applet is not as accurate as statistical software.

If you are not using software, you can find cumulative proportions for Normal curves from a table. That requires an extra step, as we now explain.

Using the standard Normal table

The extra step in finding cumulative proportions from a table is that we must first standardize to express the problem in the standard scale of z -scores. This allows us to get by with just one table, a table of standard Normal cumulative proportions. Table A in the back of the book gives standard Normal probabilities. The picture at the top of the table reminds us that the entries are cumulative proportions, areas under the curve to the left of a value z .

Example 1.38 Find the proportion from z.

What proportion of observations on a standard Normal variable Z take values less than 1.47? We need to find the area to the left of 1.47, locate 1.4 in the left-hand column of Table A and then locate the remaining digit 7 as .07 in the top row. The entry opposite 1.4 and under .07 is .9292. This is the cumulative proportion we seek. Figure 1.27 illustrates this area.

A normal distribution curve. Value z = 1.47 is marked to the right of the mean. The area under the curve to the left of the z value is shaded, representing the table entry, area = 0.9292. — Figure 1.27 The area under a standard Normal curve to the left of the point z = 1.47 is 0.9292, Example 1.38.

Now that you see how Table A works, let’s redo the NCAA Examples 1.36 and 1.37 using the table.

Example 1.39 Find the proportion from x.

What proportion of college-bound students who take the SAT have scores of at least 820? The picture that leads to the answer is exactly the same as in Example 1.36. The extra step is that we first standardize to read cumulative proportions from Table A. If x is SAT score, we want the proportion of students for which X ≥ x , where x = 820 .

Standardize. Subtract the mean, then divide by the standard deviation, to transform the problem about x into a problem about a standard Normal z :

X ≥ 820 X − 1026 209 ≥ 820 − 1026 209 Z ≥ − 0.99
Use the table. Look at the pictures in Example 1.36. From Table A, we see that the proportion of observations less than − 0.99 is 0.1611. The area to the right of − 0.99 is therefore 1 − 0.1611 = 0.8389 . This is about 84%.

The area from the table in Example 1.39 (0.8389) is slightly less accurate than the area from software in Example 1.36 (0.8378) because we must round z to two places when we use Table A. The difference is rarely important in practice.

Example 1.40 Eligibility for aid and practice.

What proportion of all students who take the SAT would be eligible to receive athletic scholarships and to practice with the team but would not be eligible to compete in the eyes of the NCAA? That is, what proportion of students have SAT scores between 720 and 820? First, sketch the areas, exactly as in Example 1.37. We again use X as shorthand for an SAT score.

Standardize.

720 ≤ X < 820 720 − 1026 209 ≤ X − 1026 209 < 820 − 1026 209 − 1.46 ≤ Z < − 0.99
Use the table.

area between − 1.46 and − 0.99 = ( area left of − 0.99 ) − ( area left of − 1.46 ) = 0.1611 − 0.0721 = 0.0890

As in Example 1.37, about 9% of students would be eligible to receive athletic scholarships and to practice with the team.

Sometimes we encounter a value of z more extreme than those appearing in Table A. For example, the area to the left of z = − 4 is not given in the table. The z -values in Table A leave only area 0.0002 in each tail unaccounted for. For practical purposes, we can act as if there is zero area outside the range of Table A.

Check-in

1.33 Find the proportion. Consider the NAEP scores, which are approximately Normal, N ( 282 , 40 ) . Find the proportion of students who have scores less than 350. Find the proportion of students who have scores greater than or equal to 350. Sketch the relationship between these two calculations using pictures of Normal curves similar to the ones given in Example 1.36 (page 54).
1.34 Find another proportion. Consider the NAEP scores, which are approximately Normal, N ( 282 , 40 ) . Find the proportion of students who have scores between 300 and 350. Use pictures of Normal curves similar to the ones given in Example 1.37 (page 55) to illustrate your calculations.

Inverse Normal calculations

Examples 1.36 to 1.40 illustrate the use of Normal distributions to find the proportion of observations in a given event, such as “SAT score between 720 and 820.” We may instead want to find the observed value corresponding to a given proportion.

Statistical software will do this directly. Without software, use Table A backward, finding the desired proportion in the body of the table and then reading the corresponding z from the left column and top row.

Example 1.41 How high for the top 10%?

Scores for college-bound students on the SAT Verbal test in recent years follow approximately the N ( 500 , 110 ) distribution.³¹ How high must a student score to place in the top 10% of all students taking the SAT?

Again, the key to the problem is to draw a picture. Figure 1.28 shows that we want the score x with an area of 0.10 above it. That’s the same as area below x equal to 0.90.

Statistical software has a function that will give you the x for any cumulative proportion you specify. The function often has a name such as “inverse cumulative probability.” Plug in mean 500, standard deviation 110, and cumulative proportion 0.9. The software tells you that x = 641 to place in the highest 10%.

A normal curve with both x and corresponding z values on the horizontal axis. — Figure 1.28 Locating the point on a Normal curve with area 0.10 to its right, Example 1.41.

Without software, first find the standard score z with cumulative proportion 0.9, then “unstandardize” to find x . Here is the two-step process:

Use the table. Look in the body of Table A for the entry closest to 0.9. It is 0.8997. This is the entry corresponding to z = 1.28 . So z = 1.28 is the standardized value with area 0.9 to its left.
Unstandardize to transform the solution from z back to the original x scale. We know that the standardized value of the unknown x is z = 1.28 . So x itself satisfies

x − 500 110 = 1.28

Solving this equation for x gives

x = 500 + ( 1.28 ) ( 110 ) = 640.8

This equation should make sense: it finds the x that lies 1.28 standard deviations above the mean on this particular Normal curve. That is the “unstandardized” meaning of z = 1.28 . The general rule for unstandardizing a z -score is

x = μ + z σ

Check-in

1.35 What score is needed to be in the top 20%? Consider the NAEP scores, which are approximately Normal, N ( 282 , 40 ) . How high a score is needed to be in the top 20% of students who take this exam?
1.36 Find the score that 75% of students will exceed. Consider the NAEP scores, which are approximately Normal, N ( 282 , 40 ) . Seventy-five percent of the students will score above x on this exam. Find x .

Normal quantile plots

The Normal distributions provide good descriptions of some distributions of real data, such as the Iowa Test vocabulary scores. The distributions of some other common variables are usually skewed and therefore distinctly non-Normal. Examples include economic variables such as personal income and gross sales of business firms, the survival times of cancer patients after treatment, and the service lifetime of mechanical or electronic components. While experience can suggest whether or not a Normal distribution is plausible in a particular case, it is risky to assume that a distribution is Normal without actually inspecting the data.

A histogram or stemplot can reveal distinctly non-Normal features of a distribution, such as outliers, pronounced skewness, or gaps and clusters. If the stemplot or histogram appears roughly symmetric and unimodal, however, we need a more sensitive way to judge the adequacy of a Normal model. The most useful tool for assessing Normality is another graph, the Normal quantile plot.

Here is the basic idea of a Normal quantile plot. The graphs produced by software use more sophisticated versions of this idea. It is not practical to make Normal quantile plots by hand.

Arrange the observed data values from smallest to largest. Record what percentile of the data each value occupies. For example, the smallest observation in a set of 20 is at the 5% point, the second smallest is at the 10% point, and so on.
Do Normal distribution calculations to find the values of z corresponding to these same percentiles. For example, z = − 1.645 is the 5% point of the standard Normal distribution, and z = − 1.282 is the 10% point. We call these values of Z Normal scores.
Plot each data point x against the corresponding Normal score. If the data distribution is close to any Normal distribution, the plotted points will lie close to a straight line.

Any Normal distribution produces a straight line on the plot because standardizing turns any Normal distribution into a standard Normal distribution. Standardizing is a linear transformation that can change the slope and intercept of the line in our plot but cannot turn a line into a curved pattern.

Use of Normal quantile plots

If the points on a Normal quantile plot lie close to a straight line, the plot indicates that the data are Normal.
Systematic deviations from a straight line indicate a non-Normal distribution.
Outliers appear as points that are far away from the overall pattern of the plot.
An optional line can be drawn on the plot that corresponds to the Normal distribution with mean equal to the mean of the data and standard deviation equal to the standard deviation of the data.

Figures 1.29 and 1.30 are Normal quantile plots for data we have met earlier. The data x are plotted vertically against the corresponding standard Normal z -score plotted horizontally. The z -score scale generally extends from − 3 to 3 because almost all of a standard Normal curve lies between these values. These figures show how Normal quantile plots behave.

Example 1.42 IQ scores are approximately Normal.

Data set icon for IQ.

Figure 1.29 is a Normal quantile plot of the 60 fifth-grade IQ scores from Table 1.1 (page 15). The points lie very close to the straight line drawn on the plot. We conclude that the distribution of IQ data is approximately Normal.

A normal quantile plot of IQ versus normal scores. — Figure 1.29 Normal quantile plot of IQ scores, Example 1.42. This distribution is approximately Normal.

Example 1.43 Times to start a business are skewed.

Data set icon for tts.

Figure 1.30 is a Normal quantile plot of the data on times to start a business from Example 1.19. The line drawn on the plot shows clearly that the plot of the data is curved. We conclude that these data are not Normally distributed. The shape of the curve is what we typically see with a distribution that is strongly skewed to the right.

A normal quantile plot of time in days versus normal scores. — Figure 1.30 Normal quantile plot for the length of time required to start a business, Exercise 1.43. This distribution is highly skewed.

Real data often show some departure from the theoretical Normal model. caution When you examine a Normal quantile plot, look for shapes that show clear departures from Normality. Don’t overreact to minor wiggles in the plot. When we discuss statistical methods that are based on the Normal model, we are interested in whether or not the data are sufficiently Normal for these procedures to work properly We are not concerned about minor deviations from Normality. Many common methods work well as long as the data are approximately Normal and outliers are not present.

Beyond the Basics

Density estimation

A density curve gives a compact summary of the overall shape of a distribution. Many distributions do not have the Normal shape. There are other families of density curves that are used as mathematical models for various distribution shapes. Modern software offers more flexible options. A density estimator does not start with any specific shape, such as the Normal shape. It looks at the data and draws a density curve that describes the overall shape of the data. Density estimators join stemplots and histograms as useful graphical tools for exploratory data analysis.

Example 1.44 Density estimation for IQ scores.

Data set icon for IQ.

In Example 1.42 we observed that the points in the Normal quantile plot for the IQ data were very close to a straight line. This suggests that a Normal distribution is a good fit for these data. Figure 1.31 provides another way to look at this issue. Here we see the histogram with a density estimate, the red curve, along with the best-fitting Normal density curve, the green curve. Because the two curves are approximately the same, we are confident in any further analysis of these data based on the assumption that the data are approximately Normal.

A screen capture of a JMP boxplot and histogram with normal and smooth density curves. — Figure 1.31 Histogram of IQ scores, with a density estimate and a Normal curve, Example 1.44. The IQ scores are approximately Normal.

The top graphical summary is a boxplot with values as follows. Minimum, 80. Q 1, 104. Median, 114. Q 3, 125. Maximum, 145. The bottom summary is a histogram that plots count on the vertical axis, ranging from 0 to 20 in increments of 2.5 versus IQ score on the horizontal axis, ranging from 80 to 150 with a class size of 10. The distribution is roughly normal. By IQ score class, the counts are as follows. 80 to less than 90, 3. 90 to less than 100, 5. 100 to less than 110, 14. 110 to less than 120, 17.5. 120 to less than 130, 11. 130 to less than 140, 9. 140 to less than 150, 2.5. Normal and smooth curves are drawn over the histogram. Both trace nearly the same normal distribution with slight variation. All values estimated.

Software Output

Here is another example where we see a different picture.

Example 1.45 Density estimation for times to start a business.

Data set icon for tts.

In Example 1.43, we examined the Normal quantile plot for the time to start a business data. Figure 1.32 shows the histogram for these data along with a density estimate and the best-fitting Normal distribution. The two density curves are very different, and we conclude that a Normal distribution does not give a good fit for these data. Not only are the data strongly skewed, but there is also a clear outlier. We should be very cautious about using a statistical analysis based on an assumption that the data are approximately Normal in this case.

A screen capture of a JMP histogram with normal and smooth density curves. — Figure 1.32 Histogram of the length of time required to start a business, with a density estimate and a Normal curve, Example 1.45. The Normal distribution is not a good fit for these data.

The histogram plots count on the vertical axis, ranging from 0 to 70 in increments of 10 versus time to start on the horizontal axis, ranging from negative 30 to 250 with a class size of 10. The distribution is right-skewed with an outlier. By time class, the counts are as follows. 0 to less than 10, 70. 10 to less than 20, 60. 20 to less than 30, 20. 30 to less than 40, 15. 40 to less than 50, 13. 50 to less than 60, 0. 60 to less than 70, 8. 70 to less than 80, 5. 80 to less than 90, 2. 90 to less than 100, 1. Outlier, 230 to less than 240, 1. Smooth and normal curves are drawn over the distribution. The smooth curve traces the right skewed distribution. The normal curve is drawn over it, but doesn’t match. It is shorter than the histogram and doesn’t account for the outlier. It has a maximum count of 30 at score 20 and appears closer to normal.curve is drawn over it, but doesn’t match. It is shorter than the histogram and doesn’t account for the outlier.

Software Output

Section 1.4 SUMMARY

We can describe the overall pattern of a distribution by a density curve. A density curve has total area 1 underneath it. An area under a density curve gives the proportion of observations that fall in a range of values.
A density curve is an idealized description of the overall pattern of a distribution that smooths out the irregularities in the actual data. We write the mean of a density curve as μ and the standard deviation of a density curve as σ to distinguish them from the mean x ¯ and the standard deviation s of the actual data.
The mean μ is the balance point of the curve. The median divides the area under the curve in half. The quartiles and the median divide the area under the curve into quarters. The standard deviation σ cannot be located by eye on most density curves.
The mean and median are equal for symmetric density curves, but the mean of a skewed curve is located farther toward the long tail than is the median.
The Normal distributions are described by a special family of bell-shaped, symmetric, unimodal density curves. The mean μ and standard deviation σ completely specify a Normal distribution N ( μ , σ ) . The mean is the center of the curve, and σ is the distance from μ to the change-of-curvature points on either side.
To standardize any observation x , subtract the mean of the distribution and then divide by the standard deviation. The resulting z -score z = ( x − μ ) / σ says how many standard deviations x lies from the distribution mean.
All Normal distributions are the same when measurements are transformed to the standardized scale. In particular, all Normal distributions satisfy the 68–95–99.7 rule, which describes what percent of observations lie within one, two, and three standard deviations of the mean.
If x has the N ( μ , σ ) distribution, then the standardized variable Z = ( X − μ ) / σ has the standard Normal distribution N ( 0 , 1 ) . Proportions for any Normal distribution can be calculated by software or from the standard Normal table (Table A), which gives the cumulative proportions of Z < z for many values of x .
The adequacy of a Normal model for describing a distribution of data is best assessed by a Normal quantile plot, which is available in most statistical software packages. A pattern on such a plot that deviates substantially from a straight line indicates that the data are not Normal.

Now that you have completed this section, you will be able to:

Sketch a Normal distribution for any given mean and standard deviation. Review Example 1.34 (page 51) and try Exercise 1.65.
Apply the 68–95–99.7 rule to find the proportions of observations within one, two, and three standard deviations of the mean for any Normal distribution. Review Example 1.34 (page 51) and try Exercise 1.65.
Find the z -score for any observation x . Review Example 1.35 (page 53) and try Exercise 1.67.
Compute areas under a Normal curve using software or Table A. Review Example 1.36 (page 54) and try Exercise 1.79.
Perform inverse Normal calculations to find values of a Normal variable corresponding to any given area. Review Example 1.41 (page 58) and try Exercise 1.81.
Assess the extent to which the distribution of a set of data can be approximated by a Normal distribution. Review Examples 1.42 (page 60) and 1.43 (page 60) and try Exercise 1.105.

Section 1.4 EXERCISES

1.61 What’s wrong? Explain what is wrong with each of the following:
1. Standardized values are always positive.
2. Ninety-five percent of the values of a Normal distribution will be within one standard deviation of the mean.
3. The standard Normal distribution has mean equal to 1 and standard deviation equal to 0.
1.62 Means and medians.
1. Sketch a symmetric distribution that is not Normal. Mark the location of the mean and the median.
2. Sketch a distribution that is skewed to the right. Mark the location of the mean and the median.
1.63 The effect of changing the standard deviation.
1. Sketch a Normal curve that has mean 20 and standard deviation 2.
2. On the same x axis, sketch a Normal curve that has mean 20 and standard deviation 4.
3. How does the Normal curve change when the standard deviation is varied but the mean stays the same?
1.64 The effect of changing the mean.
1. Sketch a Normal curve that has mean 20 and standard deviation 2.
2. On the same x axis, sketch a Normal curve that has mean 30 and standard deviation 2.
3. How does the Normal curve change when the mean is varied but the standard deviation stays the same?
1.65 NAEP eighth-grade geography scores. In Check-in question 1.29 (page 52) we examined the distribution of NAEP scores for the eighth-grade mathematics skills assessment. For eighth-grade students, the average geography score is approximately Normal, with mean 261 and standard deviation 31.
1. Sketch this Normal distribution.
2. Make a table that includes values of the scores corresponding to plus or minus one, two, and three standard deviations from the mean. Mark these points on your sketch along with the mean.
3. Apply the 68–95–99.7 rule to this distribution. Give the ranges of reading score values that are within one, two, and three standard deviations of the mean.
1.66 NAEP 12th-grade geography scores. Refer to the previous exercise. The scores for 12th-grade students on the geography assessment are approximately N ( 282 , 26 ) . Answer the questions in the previous exercise for this assessment.
1.67 Standardize some NAEP eighth-grade geography scores. The NAEP geography assessment scores for eighth-grade students are approximately N ( 261 , 31 ) . Find z -scores by standardizing the following scores: 200, 250, 280, 300, 320.
1.68 Compute the percentile scores. Refer to the previous exercise. When scores such as the NAEP assessment scores are reported for individual students, the actual values of the scores are not particularly meaningful. Usually, they are transformed into percentile scores. The percentile score is the proportion of students who would score less than or equal to the score for the individual student. Compute the percentile scores for the five scores in the previous exercise. State whether you used software or Table A for these computations.

NAEP 1.69 Are the NAEP eighth-grade geography scores approximately Normal? In Exercise 1.65, we assumed that the NAEP U.S. geography scores for eighth-grade students are approximately Normal with the reported mean and standard deviation, N ( 261 , 31 ) . Let’s check that assumption. In addition to means and standard deviations, you can find selected percentiles for the NAEP assessments (see previous exercise). For the 8th-grade geography scores, the following percentiles are reported:

Percentile	Score
10%	220
25%	242
50%	263
75%	283
90%	300

Use these percentiles to assess whether or not the NAEP geography scores for 8th-grade students are approximately Normal. Write a short report describing your methods and conclusions.

NAEP 1.70 Are the NAEP eighth-grade mathematics scores approximately Normal? Refer to the previous exercise. For the NAEP eighth-grade mathematics scores, the mean is 282, and the standard deviation is 40. Here are the reported percentiles:

Percentile	Score
10%	231
25%	255
50%	282
75%	309
90%	333

Is the N ( 282 , 40 ) distribution a good approximation for the NAEP mathematics scores? Write a short report describing your methods and conclusions.

1.71 Do women talk more? Conventional wisdom suggests that women are more talkative than men. One study designed to examine this stereotype collected data on the speech of 42 women and 37 men in the United States.³²
1. The mean number of words spoken per day by the women was 14,297, with a standard deviation of 6441. Use the 68–95–99.7 rule to describe this distribution.
2. Do you think that applying the rule in this situation is reasonable? Explain your answer.
3. The men averaged 14,060 words per day, with a standard deviation of 9065. Answer the questions in parts (a) and (b) for the men.
4. Do you think that the data support the conventional wisdom? Explain your answer. Note that in Section 7.2 we will learn formal statistical methods to answer this type of question.
1.72 Data from Mexico. Refer to the previous exercise. A similar study in Mexico was conducted with 31 women and 20 men. The women averaged 14,704 words per day, with a standard deviation of 6215. For men the mean was 15,022, and the standard deviation was 7864.
1. Answer the questions from the previous exercise for the Mexican study.
2. The means for both men and women are higher for the Mexican study than for the U.S. study. What conclusions can you draw from this observation?
1.73 A uniform distribution. If you ask a computer to generate “random numbers” between 0 and 1, you will get observations from a uniform distribution. Figure 1.33 graphs the density curve for a uniform distribution. Use areas under this density curve to answer the following questions.
1. What proportion of the observations lie below 0.75?
2. What proportion of the observations lie below 0.50?
3. What proportion of the observations lie between 0.50 and 0.75?
4. Why is the total area under this curve equal to 1?
Figure 1.33 The density curve of a uniform distribution, Exercise 1.73.
1.74 Use a different range for the uniform distribution. Many random number generators allow users to specify the range of the random numbers to be produced. Suppose that you specify that the outcomes are to be distributed uniformly between 0 and 5. Then the density curve of the outcomes has constant height between 0 and 5 and height 0 elsewhere.
1. What is the height of the density curve between 0 and 5? Draw a graph of the density curve.
2. Use your graph from part (a) and the fact that areas under the curve are proportions of outcomes to find the proportion of outcomes that are more than 2.
3. Find the proportion of outcomes that lie between 2.5 and 3.0.
1.75 Find the mean, the median, and the quartiles. What are the mean and the median of the uniform distribution in Figure 1.33? What are the quartiles?
1.76 Three density curves. Figure 1.34 displays three density curves, each with three points marked on it. At which of these points on each curve do the mean and the median fall?

Figure 1.34 Three density curves, Exercise 1.76.

Three points are marked on each curve, A, B, and C. The first curve is right skewed. Point A is marked at the vertex. Points B and C are marked just right of A, before the tail, with C closer to B than B is to A. The second curve is a normal distribution. Point A is marked at the mean with B and C marked at regular intervals to the right of it. The third curve is left skewed. Point C is marked at the vertex. Points B and A are marked to the left of C, with A closer to B than B is to C.
1.77 Use the Normal Curve applet. Use the Normal Curve applet for the standard Normal distribution to say how many standard deviations above and below the mean the quartiles of any Normal distribution lie.
1.78 Use the Normal Curve applet. The 68–95–99.7 rule for Normal distributions is a useful approximation. You can use the Normal Curve applet on the text website to see how accurate the rule is. Drag one flag across the other so that the applet shows the area under the curve between the two flags.
1. Place the flags one standard deviation on either side of the mean. What is the area between these two values? What does the 68–95–99.7 rule say this area is?
2. Repeat for locations two and three standard deviations on either side of the mean. Again compare the 68–95–99.7 rule with the area given by the applet.
1.79 Find some proportions. Using either software or Table A, find the proportion of observations from a standard Normal distribution that satisfies each of the following statements. In each case, sketch a standard Normal curve and shade the area under the curve that is the answer to the question.
1. Z > 1.85
2. Z < 1.85
3. Z > − 0.90
4. − 0.90 < Z < 1.85
1.80 Find more proportions. Using either software or Table A, find the proportion of observations from a standard Normal distribution for each of the following events. In each case, sketch a standard Normal curve and shade the area representing the proportion.
1. Z ≤ − 1.7
2. Z ≥ − 1.7
3. Z > 2.1
4. − 1.7 < Z < 2.1
1.81 Find some values of z. Find the value z of a standard Normal variable Z that satisfies each of the following conditions. (If you use Table A, report the value of z that comes closest to satisfying the condition.) In each case, sketch a standard Normal curve with your value of z marked on the axis.
1. 68% of the observations fall below z
2. 75% of the observations fall above z
1.82 Find more values of z illustrate the result with a sketch. The variable Z has a standard Normal distribution.
1. Find the number z that has cumulative proportion 0.68.
2. Find the number z such that the event Z > z has proportion 0.122.
1.83 Find some values of z. The Wechsler Adult Intelligence Scale (WAIS) is the most common IQ test. The scale of scores is set separately for each age group, and the scores are approximately Normal, with mean 100 and standard deviation 15. People with WAIS scores below 70 are considered developmentally disabled when, for example, applying for Social Security disability benefits. What percent of adults are developmentally disabled by this criterion?
1.84 High IQ scores. Refer to the previous exercise, The organization MENSA, which calls itself “the high-IQ society,” requires a WAIS score of 130 or higher for membership. What percent of adults would qualify for membership?

There are two major tests of readiness for college, the ACT and the SAT. ACT scores are reported on a scale from 1 to 36. The distribution of ACT scores is approximately Normal, with mean μ = 21.5 and standard deviation σ = 5.4 . SAT scores are reported on a scale from 400 to 1600. The distribution of SAT scores is approximately Normal, with mean μ = 1026 and standard deviation σ = 209 . Exercises 1.85 through 1.94 are based on this information.
1.85 Compare an SAT score with an ACT score. Jessica scores 1240 on the SAT. Ashley scores 28 on the ACT. Assuming that both tests measure the same thing, who has the higher score? Report the z -scores for both students.
1.86 Make another comparison. Joshua scores 14 on the ACT. Anthony scores 690 on the SAT. Assuming that both tests measure the same thing, who has the higher score? Report the z -scores for both students.
1.87 Find the ACT equivalent. Jorge scores 1400 on the SAT. Assuming that both tests measure the same thing, what score on the ACT is equivalent to Jorge’s SAT score?
1.88 Find the SAT equivalent. Alyssa scores 32 on the ACT. Assuming that both tests measure the same thing, what score on the SAT is equivalent to Alyssa’s ACT score?
1.89 Find an SAT percentile. Reports on a student’s ACT or SAT results usually give the percentile as well as the actual score. The percentile is just the cumulative proportion stated as a percent: the percent of all scores that were lower than or equal to this one. Renee scores 1360 on the SAT. What is her percentile?
1.90 Find an ACT percentile. Reports on a student’s ACT or SAT results usually give the percentile as well as the actual score. The percentile is just the cumulative proportion stated as a percent: the percent of all scores that were lower than or equal to this one. Joshua scores 21 on the ACT. What is his percentile?
1.91 How high is the top 15%? What SAT scores make up the top 15% of all scores?
1.92 How low is the bottom 15%? What SAT scores make up the bottom 15% of all scores?
1.93 Find the ACT quintiles. The quintiles of any distribution are the values with cumulative proportions 0.20, 0.40, 0.60, and 0.80. What are the quintiles of the distribution of ACT scores?
1.94 Find the SAT quartiles. The quartiles of any distribution are the values with cumulative proportions 0.25 and 0.75. What are the quartiles of the distribution of SAT scores?
1.95 Do you have enough “good cholesterol”? High-density lipoprotein (HDL) is sometimes called the “good cholesterol” because high values are associated with a reduced risk of heart disease. According to the American Heart Association, people over the age of 20 years should have at least 40 milligrams per deciliter (mg/dl) of HDL cholesterol.³³ U.S. women aged 20 and over have a mean HDL of 55 mg/dl with a standard deviation of 15.5 mg/dl. Assume that the distribution is Normal.
1. What percent of women have low values of HDL (40 mg/dl or less)?
2. HDL levels of 60 mg/dl and higher are believed to protect people from heart disease. What percent of women have protective levels of HDL?
3. Women with more than 40 mg/dl but less than 60 mg/dl of HDL are in the intermediate range, neither very good or very bad. What proportion are in this category?
1.96 Men and HDL cholesterol. HDL cholesterol levels for men have a mean of 46 mg/dl, with a standard deviation of 13.6 mg/dl. Assume that the distribution is Normal. Answer the questions given in the previous exercise for the population of men.
1.97 Diagnosing osteoporosis. Osteoporosis is a condition in which the bones become brittle due to loss of minerals. To diagnose osteoporosis, an elaborate apparatus measures bone mineral density (BMD). BMD is usually reported in standardized form. The standardization is based on a population of healthy young adults. The World Health Organization (WHO) criterion for osteoporosis is a BMD 2.5 standard deviations below the mean for young adults. BMD measurements in a population of people similar in age and sex roughly follow a Normal distribution.
1. What percent of healthy young adults have osteoporosis by the WHO criterion?
2. Women aged 70 to 79 are of course not young adults. The mean BMD in this age is about − 2 on the standard scale for young adults. Suppose that the standard deviation is the same as for young adults. What percent of this older population has osteoporosis?
1.98 Deciles of Normal distributions. The deciles of any distribution are the 10th, 20th, . . . , 90th percentiles. The first and last deciles are the 10th and 90th percentiles, respectively.
1. What are the first and last deciles of the standard Normal distribution?
2. The weights of 9-ounce potato chip bags are approximately Normal, with mean 9.11 ounces and standard deviation 0.14 ounce. What are the first and last deciles of this distribution?
1.99 Quartiles for Normal distributions. The quartiles of any distribution are the values with cumulative proportions 0.25 and 0.75.
1. What are the quartiles of the standard Normal distribution?
2. Using your numerical values from (a), write an equation that gives the quartiles of the N ( μ , σ ) distribution in terms of μ and σ .
1.100 IQR for Normal distributions. Continue your work from the previous exercise. The interquartile range IQR is the distance between the first and third quartiles of a distribution.
1. What is the value of the IQR for the standard Normal distribution?
2. There is a constant c such that I Q R = c σ for any Normal distribution N ( μ , σ ) . What is the value of c ?
1.101 Outliers for Normal distributions. Continue your work from the previous two exercises. The percent of the observations that are suspected outliers according to the 1.5 × I Q R rule is the same for any Normal distribution. What is this percent?
1.102 Deciles of HDL cholesterol. The deciles of any distribution are the 10th, 20th, . . . , 90th percentiles. Refer to Exercise 1.95 where we assumed that the distribution of HDL cholesterol in U.S. women aged 20 and over is Normal with mean 55 mg/dl and standard deviation 15.5 mg/dl. Find the deciles for this distribution.
1.103 Longleaf pine trees. Exercise 1.56 (page 46) gives the diameter at breast height (DBH) for 40 longleaf pine trees from the Wade Tract in Thomas County, Georgia. Make a Normal quantile plot for these data and write a short paragraph interpreting what it describes.
1.104 Potassium from potatoes. Refer to Exercise 1.15 (page 22), where you used s stemplot to examine the potassium absorption of a group of 27 adults who ate a controlled diet that included 40 mEq of potassium from potatoes for five days. In Exercise 1.33 (page 43), you compared the stemplot, the histogram, and the boxplot as graphical summaries of this distribution.
1. Generate these three graphical summaries.
2. Make a Normal quantile plot and interpret it.
1.105 Potassium from a supplement. Refer to Exercise 1.16 (page 22), where you used a stemplot to examine the potassium absorption of a group of 29 adults who ate a controlled diet that included 40 mEq of potassium from a supplement for five days. In Exercise 1.34 (page 43), you compared the stemplot, the histogram, and the boxplot as graphical summaries of this distribution.
1. Generate these three graphical summaries.
2. Make a Normal quantile plot and interpret it.