A scatterplot displays the form, direction, and strength of the relationship between two quantitative variables. Linear (straight-line) relations are particularly important because a straight line is a simple pattern that is quite common.
We say a linear relationship is strong if the points lie close to a straight line and weak if they are widely scattered about a line. Our eyes, however, are not the best judges of how strong a relationship is. The two scatterplots in Figure 2.14 depict exactly the same data, but the lower plot is in an expanded field of possible values. The lower plot seems to show a stronger relationship.
Figure 2.14 Two scatterplots of the same data. The linear pattern in the lower plot appears stronger because of the surrounding open space.
Our eyes can be fooled by changing the plotting scales or the amount of white space around the cloud of points in a scatterplot.16 We need to follow our strategy for data analysis by using a numerical measure to supplement the graph. Correlation is the measure we use.
We have data on variables x and y for n cases.
Think, for example, of measuring height and weight of n people,
including ourselves. We could then have
As always, the summation sign
The formula for r begins by standardizing the observations.
Going back to our height and weight example, suppose that x is
height in centimeters and y is weight in kilograms. Then
is the standardized height of the ith person. A standardized value, or z-score, says how many standard deviations above or below the mean a value lies. This means standardized values have no units. For this example, this means the standardized heights are no longer measured in centimeters. If we also standardize the weights, then these two variables are now on a comparable scale. The correlation r is the average of the products of the standardized height and the standardized weight for the n people.
Figure 2.3 (page 80) gives the scatterplot of rating versus price per load for 52 laundry detergents. Software gives the value of the correlation as 0.21. A straight line is included to help us evaluate the form of the relationship. The relationship is positive and approximately linear but very weak. The value of the correlation is very low, supporting the graphical summary that shows a weak relationship.
Figure 2.1 (page 78) gives the scatterplot for the laundry detergent data with the outlier included. The major feature in the plot is the outlier, which has a somewhat average rating but a price that is about twice as much as the next most expensive detergent. The correlation measures the direction and strength of a linear relationship, so it is not a very useful numerical summary for this relationship.
Figure 2.6 (page 82) is a scatterplot of calcium retention versus calcium intake. The relationship is clearly curved where retention levels off at higher levels of intake. The correlation is not a good numerical summary for this relationship.
2.11 Laundry detergents.
Example 2.9
(page 77)
describes data on the rating and price per load for 53 laundry
detergents. Use software to show that the correlation between
rating and price is the same as the correlation between price
and rating. Report the correlation with four digits after the
decimal.
2.12 Change the units. Refer to the previous
Check-in question. Express the price per load in dollars.
Is the transformation from cents to dollars a linear transformation? Explain your answer.
Compute the correlation between rating and price per load expressed in dollars.
How does the correlation that you computed in part (b) compare with the one you computed in the previous Check-in question?
What can you say in general about the effect of changing units using linear transformations on the size of the correlation?
The formula for correlation helps us see that r is positive when there is a positive association between the variables. Height and weight, for example, have a positive association. People who are above average in height tend to also be above average in weight. Both the standardized height and the standardized weight for such a person are positive. People who are below average in height tend also to have below-average weight. Then both standardized height and standardized weight are negative. In both cases, the products in the formula for r are mostly positive, so r is positive. In the same way, we can see that r is negative when the association between x and y is negative. More detailed study of the formula gives more detailed properties of r.
Here is what you need to know to interpret correlation:
Correlation makes no use of the distinction between explanatory and response variables. It makes no difference which variable you call x and which you call y in calculating the correlation.
Correlation requires that both variables be quantitative.
For example, we cannot calculate a correlation between the incomes
of a group of people and what city they live in because city is a
categorical variable.
Because r uses the standardized values of the observations, r does not change when we change the units of measurement (a linear transformation) of x, y, or both. Measuring height in inches rather than centimeters and weight in pounds rather than kilograms does not change the correlation between height and weight. The correlation r itself has no unit of measurement; it is just a number.
Positive r indicates positive association between the variables, and negative r indicates negative association.
The correlation r is always a number between
Correlation measures the strength of only the linear relationship
between two variables.
Correlation does not describe curved relationships between
variables, no matter how strong they are.
Like the mean and standard deviation, the correlation is not resistant: r is strongly affected by a few outlying observations. Use r with caution when outliers appear in the scatterplot.
The scatterplots in
Figure 2.15 illustrate how
values of r closer to 1 or
Figure 2.15 How the correlation r measures the direction and strength of a linear association.
Finally, remember that correlation is not a complete description of two-variable data, even when the relationship between the variables is linear. You should give the means and standard deviations of both x and y along with the correlation. (Because the formula for correlation uses the means and standard deviations, these measures are the proper choices to accompany a correlation.) Conclusions based on correlations alone may require rethinking in the light of a more complete description of the data.
Until a scandal at the 2002 Olympics brought change, figure skating
was scored by judges on a scale from 0.0 to 6.0. The scores were
often controversial. We have the scores awarded by two judges,
Pierre and Elena. How well do they agree? The correlation between
their scores is
These facts in Example 2.21 do not contradict each other. They are simply different kinds of information. The mean scores show that Pierre awards lower scores than Elena. But because Pierre gives every skater a score about 0.8 point lower than Elena, the correlation remains high. Adding the same number to all values of either x or y does not change the correlation. If both judges score the same skaters, the competition is scored consistently because Pierre and Elena agree on which performances are better than others. The high r shows their agreement. But if Pierre scores some skaters and Elena others, we must add 0.8 point to Pierre’s scores to arrive at a fair comparison.
The correlation r measures the direction and strength of the linear (straight-line) association between two quantitative variables x and y. It has no unit of measurement.
Correlation indicates the direction of a linear relationship by
its sign:
Correlation always satisfies
Correlation ignores the distinction between explanatory and response variables. The value of r is not affected by changes in the unit of measurement of either variable.
While scatterplots give a full graphical summary of the relationship between two variables, it can be hard to see the form of the relationship. The correlation can give a better description of the relationship, but it is only meaningful for linear relationships, and is not resistant to outliers (which can greatly change the value of r).
2.28 What’s wrong? Explain what is wrong with each of the following:
A correlation of 2.0 indicates a very strong positive relationship.
When reporting a correlation, you should always give its units.
The correlation between two quantitative variables is always positive.
Ashley obtains the blood pressure of 10 students from her dorm, and Madison obtains a stress score of 10 students from her dorm. A scatterplot of these two variables will help to explain the correlation between them.
2.29 Interpret some correlations. For each of the following correlations, describe the relationship between the two quantitative variables in terms of the direction and the strength of the linear relationship.
2.30 Blueberries and anthocyanins. In
Exercise 2.8
(page 88),
you examined the relationship between Antho4 and Antho3, two
anthocyanins found in blueberries.
Find the correlation between these two anthocyanins.
Look at the scatterplot for these data that you made in part (a) of Exercise 2.8 (or make one if you did not do that exercise). Is the correlation a good numerical summary of the graphical display in the scatterplot? Explain your answer.
Does the size of the correlation suggest that the amounts of these two anthocyanins is approximately equal in these blueberries? Explain why or why not.
2.31 Blueberries and anthocyanins with logs.
In
Exercise 2.9
(page 88),
you examined the relationship between Antho4 and Antho3, two
anthocyanins found in blueberries, using logs for both
variables. Answer the questions in the previous exercise for the
variables transformed in this way.
2.32 Fuel consumption. In
Exercise 2.11
(page 88),
you used a scatterplot to examine the relationship between
2.33 Fuel consumption for different types of vehicles.
In
Exercise 2.13
(page 89),
you examined the relationship between
2.34 Strong association but no correlation.
Here is a data set that illustrates an important point about
correlation:
X | 45 | 55 | 65 | 75 | 85 |
Y | 30 | 50 | 70 | 50 | 30 |
Make a scatterplot of Y versus X.
Describe the relationship between Y and X. Is it weak or strong? Is it linear?
Find the correlation between Y and X.
What important point about correlation does this exercise illustrate?
2.35 Bone strength.
Exercise 2.14
(page 89)
gives the bone strengths of the dominant and the nondominant
arms of 15 men who were controls in a study.
Find the correlation between the bone strength of the dominant arm and the bone strength of the nondominant arm.
Look at the scatterplot for these data that you made in part (a) of Exercise 2.14 (or make one if you did not do that exercise). Is the correlation a good numerical summary of the graphical display in the scatterplot? Explain your answer.
2.36 Bone strength for baseball players. Refer
to the previous exercise. Similar data for baseball players are
given in
Exercise 2.15
(page 89).
Answer parts (a) and (b) of the previous exercise for these
data.
2.37 Student ratings of teachers. A college newspaper interviews a psychologist about student ratings of the teaching of faculty members. The psychologist says, “The evidence indicates that the correlation between the research productivity and teaching rating of faculty members is close to zero.” The paper reports this as “Professor McDaniel said that good researchers tend to be poor teachers, and vice versa.” Explain why the paper’s report is wrong. Write a statement in plain language (without using the word “correlation”) to explain the psychologist’s meaning.
2.38 Decay of a radioactive element. Data for
an experiment on the decay of barium-137m is given in
Exercise 2.22
(page 90).
Find the correlation between the radioactive counts and the time after the start of the first counting period.
Does the correlation give a good numerical summary of the relationship between these two variables? Explain your answer.
2.39 Decay in the log scale.
Refer to the previous exercise and to
Exercise 2.23
(page 90),
where the counts were transformed by a log.
Find the correlation between the log counts and the time after the start of the first counting period.
Does the correlation give a good numerical summary of the relationship between these two variables? Explain your answer.
Compare your results for this exercise with those from the previous exercise.
2.40 Brand names and generic products.
If a store always prices its generic “store brand” products at 80% of the brand name products’ prices, what would be the correlation between the prices of the brand name products and the store brand products? (Hint: Draw a scatterplot for several prices.)
If the store always prices its generic products $2 less than the corresponding brand name products, then what would be the correlation between the prices of the brand name products and the store brand products?
2.41 Alcohol and calories in beer.
Figure 2.12
(page 90)
gives a scatterplot of the calories versus percent alcohol for
160 brands of domestic beer.
Compute the correlation for these data.
Does the correlation do a good job of describing the direction and strength of this relationship? Explain your answer.
2.42 Alcohol and calories in beer revisited.
Refer to the previous exercise. The data that you used to
compute the correlation include an outlier.
Remove the outlier and recompute the correlation.
Write a short paragraph about the possible effects of outliers on a correlation, using this example to illustrate your ideas.
2.43 Compare domestic with imported.
In
Exercise 2.21
(page 90),
you compared domestic beers with imported beers with respect to
the relationship between calories and percent alcohol. In that
exercise, you used scatterplots to make the comparison. Compute
the correlations for these two categories of beer and write a
new summary of the comparison, using correlations in addition to
the scatterplots.
2.44 Use the applet. Go to the
Correlation and Regression applet. Click on the
scatterplot to create a group of 10 points in the lower-left
corner of the scatterplot with a strong straight-line positive
pattern (correlation about 0.9).
Add one point at the upper right that is in line with the first 10. How does the correlation change?
Drag this last point down until it is opposite the group of 10 points. How small can you make the correlation? Can you make the correlation negative? A single outlier can greatly strengthen or weaken a correlation. Always plot your data to check for outlying points.
2.45 Use the applet.
You are going to use the
Correlation and Regression applet to make different
scatterplots with 10 points that have correlation close to 0.9.
Many patterns can have the same correlation. Always plot your
data before you trust a correlation.
Stop after adding the first two points. What is the value of the correlation? Why does it have this value no matter where the two points are located?
Make a lower-left to upper-right pattern of 10 points with
correlation about
Make another scatterplot, this time with 9 points in a vertical stack at the left of the plot. Add one point far to the right and move it until the correlation is close to 0.9. Make a rough sketch of your scatterplot.
Make yet another scatterplot, this time with 10 points in a curved pattern that starts at the lower left, rises to the right, then falls again at the far right. Adjust the points up or down until you have a quite smooth curve with correlation close to 0.8. Make a rough sketch of this scatterplot also.
2.46 An interesting set of data. Make a
scatterplot of the following data:
x | 1 | 2 | 3 | 4 | 10 | 10 |
y | 1 | 3 | 3 | 5 | 1 | 11 |
Verify that the correlation is about 0.5. What feature of the data is responsible for reducing the correlation to this value despite a strong straight-line association between x and y in most of the observations?
2.47 Internet use and babies.
Figure 2.13
(page 90) is
a scatterplot of the number of births per 1000 people versus
Internet users per 100 people for 106 countries. In
Exercise 2.24
(page 90),
you described this relationship.
Make a plot of the data similar to Figure 2.13 and report the correlation.
Is the correlation a good numerical summary for this relationship? Explain your answer.
2.48 What’s wrong? Explain what is wrong with each of the following:
There is a high correlation between the age of American workers and their occupation.
We found a high correlation
The correlation between the sex of a group of students and
the color of their cell phone was