2.4 Least-Squares Regression in Chapter 2 Looking at Data

2.4 Least-Squares Regression

Correlation measures the direction and strength of the linear (straight-line) relationship between two quantitative variables. If a scatterplot shows a linear relationship, we summarize this overall pattern by drawing a line on the scatterplot. A regression line summarizes the relationship between two variables, but only in a specific setting: when one of the variables helps explain or predict the other. That is, regression describes a relationship between a response variable and an explanatory variable.

Example 2.22 World Economic Forum.

The World Economic Forum studies data on many variables related to financial development in the countries of the world. In 2017, this organization introduced a new metric, the Inclusive Development Index (IDI), which measures the impact of a country’s economic policy and growth on all its citizens.¹⁷ One of the variables used in the metric is median per capita daily income (MI), measured in U.S. dollars. Here are the data for 15 countries that ranked high on the IDI:

Country	IDI	MI	Country	IDI	MI	Country	IDI	MI
Australia	5.36	44.4	Iceland	6.07	43.4	Korea Rep	5.09	34.2
Belgium	5.14	43.8	Ireland	5.44	38.0	Netherlands	5.61	43.3
Canada	5.06	49.2	Israel	4.51	25.8	Norway	6.08	63.8
Czech Republic	5.09	24.3	Italy	4.31	34.3	Portugal	3.97	21.2
Estonia	4.74	22.1	Japan	4.53	34.8	United Kingdom	4.89	39.4

How well does MI predict the IDI? Figure 2.16 is a scatterplot of the data. The correlation is r=0.74. The scatterplot includes a regression line drawn through the points.

A scatterplot of Inclusive Development Index data. — Figure 2.16 Scatterplot of the Inclusive Development Index (IDI) and median per capita daily income for 15 countries that rank high on financial development, Example 2.22.

Fitting a line to data

When a scatterplot displays a linear pattern, we can describe the overall pattern by drawing a regression line through the points. Of course, no straight line passes exactly through all the points. Fitting a line to data means drawing a line that comes as close as possible to the points. The equation of a line fitted to the data gives a concise description of the relationship between the response variable y and the explanatory variable x. It is the numerical summary that supports the scatterplot—our graphical summary.

In practice, we will use software to obtain values of b0 and b1 for a given set of data.

Example 2.23 Regression line for IDI.

Any straight line describing the relationship between IDI and MI has the form

IDI=b0+(b1×MI)

In Figure 2.16 we have drawn the regression line with the equation

IDI=3.609+(0.03872×MI)

To make the plot, we chose two values of MI within the range of the data and evaluated the value of IDI using the regression equation. In this case, for example, we chose MI =20 and MI =60, and the corresponding values of IDI are

IDI=3.609+(0.03872×20)=4.38

and

IDI=3.609+(0.03872×60)=5.93

We plotted these two points, (x,y)=(20, 4.38) and (x,y)=(60, 5.93), on our scatterplot and connected them with a straight line. As a check on this type of work, it is a good idea to compute a third point and verify that it is on your straight line.

Figure 2.16 shows that the regression line fits the data reasonably well. The slope b1=0.03872 tells us that IDI goes up by 0.03872 unit for each added U.S. dollar of MI.

Check-in

2.13 A regression line. A regression equation is y=20+30x.
1. What is the slope of the regression line?
2. What is the intercept of the regression line?
3. Use the regression equation to find the values of y for x=0, x=30, and x=60.
4. Plot the regression line for values of x between 0 and 60.
2.14 Plot a line. Make a plot of the data in Example 2.22 and plot the line

IDI=4.609+(0.03872×MI)

on your sketch. Explain why this line does not give a good fit to the data.

Prediction

We can use a regression line to make a prediction of the response y for a specific value of the explanatory variable x. We can interpret the prediction as the average value of y corresponding to a collection of cases at the particular value of x or as our best guess of the value of y for a case with the particular value of x.

Example 2.24 Visualize the prediction for IDI.

Based on the linear pattern, we want to predict IDI for a country whose MI is $40. To use the fitted line to predict IDI, go “up and over” on the graph in Figure 2.17. From x=40 on the x axis, go up to the fitted line and over to the y axis. The graph shows that the predicted IDI is slightly larger than 5.

If we have the equation of the line, it is faster and more accurate to substitute x=40 in the equation. The predicted IDI is

IDI=3.609+(0.03872×40)=5.16

The degree of uncertainty of predictions from a regression line depends on how much scatter about the line the data show. In Figure 2.17, IDI for values of MI around 40 show a spread of 0.5 unit.

Check-in

2.15 Predict the IDI. Use the regression equation in Example 2.20 to predict the IDI for a country whose MI is $35.

The least-squares regression line

Data set icon for Vtm.

Different people might draw different lines by eye on a scatterplot. This is especially true when the points are widely scattered. We need a way to draw a regression line that doesn’t depend on our guess as to where the line should go. We will use the least-squares idea to select the best regression line. To get started, we’ll think about the distance between an observed value of the response variable (a data point on a scatterplot) and the corresponding value for that case on the regression line predicted by an equation.

Let’s look at the World Economic Forum data for Canada. From Example 2.22 (page 99), we see that IDI =5.06 and MI =49.2 for this case. Using the equation from Example 2.23, we calculate the predicted value of IDI:

predicted IDI=4.609+(0.03872×MI)=4.609+(0.03872×49.2)=5.51

For Canada, the distance between the observed value of IDI and the predicted value of IDI is

distance=observed IDI − predicted IDI=5.06−5.51=−0.45

Note that this distance is negative. Distances are negative if the observed response lies below the line and positive if the response lies above the line.

Check-in

2.16 Find a distance. Use the regression line in Example 2.23 to estimate the IDI for Italy. What is the distance between the observed IDI for Italy and this predicted value?
2.17 Positive and negative distances. Examine Figure 2.16 carefully. How many of the distances are positive? How many are negative?

We are now ready to describe the least-squares idea and the least-squares line. No line will pass exactly through all the points. Even so, we want the distances of the data points from the line to be as small as possible.

Example 2.25 The least-squares idea.

Figure 2.18 illustrates the idea. This plot shows some of the data, along with a regression line. The distances of the data points from the line appear as vertical dashed line segments.

The graph plots Inclusive Development Index on the vertical axis, ranging from 4 to 6 in increments of 1, versus median per capita daily income in U S dollars on the horizontal axis, ranging from 33.5 to 38.5 in increments of 1. Four points are plotted at (34.1, 5.1), (34.3, 4.3), (34.7, 4.5), and (38.1, 5.5). A diagonal least-squares regression line rises left to right through the center of the cluster from (33.5, 4.9) through (38.5, 5.1). Two points are above the line and two are below. Vertical segments connect each point to the regression line. The point (negative 57, 3.0) is highlighted. The actual point is observed y, the point where the vertical segment reaches the regression line is the predicted y, and the segment itself represents the distance between the two. The least-squares regression line minimizes the sum of the squares of the vertical distances from all the data points to the line. All values estimated.

One reason for the popularity of the least-squares regression line is that the problem of finding the line has a simple solution. We can give the recipe for the least-squares line in terms of the means and standard deviations of the two variables and their correlation.

Equation of the least-squares regression line

We have data on an explanatory variable x and a response variable y for n individuals. The means and standard deviations of the sample data are x¯ and sx for x and y¯ and sy for y, and the correlation between x and y is r. The equation of the least-squares regression line of y on x is

y^=b0+b1x

with slope

b1=rsysx

and intercept

b0=y¯−b1x¯

b0 and b1 are the regression coefficients of the least-squares equation.

We write y^ (read “y hat”) in the equation of the regression line to emphasize that the line gives a predicted response y^ for any x. In some applications, we use the equation to predict y for values of x that may or may not be in our original set of data. This use of regression is sometimes called predictive analytics or simply analytics. Because of the scatter of points about the line, the predicted response will usually not be exactly the same as the actually observed response y.

Example 2.26 The equation for predicting IDI.

The line in Figure 2.16 is, in fact, the least-squares regression line for predicting the IDI from the MI. Software gives the equation of this line as

y^=3.609+0.03872x

It is estimated using data from 15 countries that ranked high on the IDI.

Example 2.27 Check the calculation of the line.

We can check the calculation of the least-squares regression line given in Example 2.26 by using the means and standard deviations for x (MI) and y (IDI) as well as the correlation between these two variables. Here are the pieces we need:

x¯=37.466667, y¯=5.0593333sx=11.434826, sy=0.6018361

and

r=0.7357

The slope is

b1=rsysx=0.73570.601836111.434826=0.038721

and the intercept is

b0=y¯−b1x¯=5.0593333−0.038721×37.466667=3.60862

The equation of the least-squares line is indeed

y^=3.609+0.03872x

caution When doing calculations like this by hand, you may need to carry extra decimal places in the preliminary calculations to get accurate values of the slope and intercept. In practice, you don’t need to calculate the means, standard deviations, and correlation first. Statistical software or your calculator will give the slope b1 and intercept b0 of the least-squares line from keyed-in values of the variables x and y. You can then concentrate on understanding and using the regression line. caution Be careful, though: different software packages and calculators label the slope and intercept differently in their output, so remember that the slope is the value that multiplies x in the equation.

Example 2.28 Regression using software.

Figure 2.19 displays regression output for the IDI data from three statistical software packages. You can find the slope and intercept of the least-squares line, calculated to more decimal places than we need, in each output. The software also provides information that we do not yet need, including some that we did not include in Figure 2.19.

Minitab, Excel, and JMP outputs of a regression analysis. — Figure 2.19 Selected least-squares regression outputs for the world financial markets data, Example 2.28. Other software produces similar output.

The Minitab output shows a fitted line plot where I D I = 3.609 + 0.03872 M I. S = 0.423019, R square = 54.1 percent. R square adjusted = 50.6 percent. The graph plots I D I on the vertical axis, ranging from 3.5 to 6.5 in increments of 0.5, versus M I on the horizontal axis, ranging from 20 to 70 in increments of 10. Fourteen points are plotted in a loose, linear cluster that rises diagonally from left to right. A regression line rises through the center of the cluster from approximately (20, 4.5) through (63, 6.0). Below the graph are two tables of regression analysis for I D I versus M I. The regression equation is I D I = 3.609 + 0.03872 M I. The data from the tables are as follows. Model summary. S, 0.423019. R square, 54.12 percent. R square adjusted, 50.60 percent. Analysis of variance. Source, regression. D f, 1. S S, 2.74461. M S, 2.74461 F, 15.34. S, 0.002. Source, error. D F, 13. S S, 2.32628. M S, 0..17894. Source, total. D F, 14. S S, 5.07089. The Excel output shows an M I line fit plot. The graph plots I D I on the vertical axis, ranging from 0 to 8 in increments of 2, versus M I on the horizontal axis, ranging from 0.0 to 80.0 in increments of 20. Fourteen points representing I D I are plotted in a loose, linear cluster that rises diagonally from left to right. A second set of points representing predicted I D I rises in a straight diagonal pattern through the center of the first cluster from approximately (20, 4.5) through (63, 6.0). Below the graph are two tables of regression analysis with the following data. First table. Regression statistics. Multiple R, 0.73569553. R square, 0.541247913. Adjusted R square, 0.505959291. Standard error, 0.423018721. Observations, 15. Second table. Term, intercept. Coefficients, 3.608585784. Term, M I, 0.03872102. The JMP output shows an expanded dropdown list menu titled, bivariate fit of I D I by M I. Beneath it is a scatterplot. The graph plots I D I on the vertical axis, ranging from 3.5 to 6.5 in increments of 0.5, versus M I on the horizontal axis, ranging from 20 to 65 in increments of 5. Fourteen points representing I D I are plotted in a loose, linear cluster that rises diagonally from left to right. A linear fit line rises diagonally left to right through the center of the cluster from approximately (20, 4.4) through (63, 6). Below the graph is an expanded menu, linear fit, that lists the following equation I D I = 3.6085858 + 0.038721 times M I. Below is an expanded menu, summary of fit, that shows a table with the following data. R square, 0.541248. R square adjusted, 0.505959. Root mean square error, 0.423019. Mean of response, 5.059333. Observations, or sum weights, 15. Beneath is a collapsed menu, analysis of variance. Beneath is an expanded menu, parameter of estimates, that shows a table with the following data. Term, intercept. Estimate, 3.6085858. Standard error, 0.386201. t ratio, 9.34. Probability greater than absolute value of t, less than 0.001 asterisk. Term, M I. Estimate, 0.038721. Standard error, 0.009887. t ratio, 3.92. Probability greater than absolute value t, 0.0018 asterisk.

Part of the art of using software is to ignore the extra information that is almost always present. Look for the results that you need. Once you understand a statistical method, you can read output from almost any software.

Check-in

2.18 Predicted values for MI and IDI. Refer to the World Economic Forum data in Example 2.22.
1. Use software to compute the coefficients of the regression equation. Indicate where to find the slope and the intercept on the output and report these values.
2. Make a scatterplot of the data with the least-squares line.
3. For Japan, the IDI is 4.51 and the MI is $34.8. Find the predicted value of IDI for Japan.
4. Find the difference between the actual value and the predicted value for Japan.

Facts about least-squares regression

The use of regression to describe the relationship between a response variable and an explanatory variable is one of the most commonly encountered statistical methods, and least-squares is the most commonly used technique for fitting a regression line to data. Here are some facts about least-squares regression lines.

Fact 1. There is a close connection between correlation and the slope of the least-squares line. The slope is

b1=rsysx

This equation says that along the regression line, a change of 1 standard deviation in x corresponds to a change of r standard deviations in y. When the variables are perfectly correlated (r=1 or r=−1), the change in the predicted response y^ is the same (in standard deviation units) as the change in x. Otherwise, because −1≤r≤1, the change in y^ is less than the change in x. As the correlation grows less strong, the prediction y^ moves less in response to changes in x. Note that if the correlation is zero, then the slope of the least-squares regression line will be zero.

Fact 2. The least-squares regression line always passes through the point (x¯,y¯) on the graph of y against x. This means the least-squares regression line of y on x is the line with slope rsy/sx that passes through the point (x¯,y¯). We can describe regression entirely in terms of the basic descriptive measures x¯, sx, y¯, sy, and r.

Fact 3. The distinction between explanatory and response variables is essential in regression. Least-squares regression looks at the distances of the data points from the line only in the y direction. If we reverse the roles of the two variables, we get a different least-squares regression line.

Example 2.29 World Economic Forum.

Figure 2.20 is a scatterplot of the World Economic Forum data described in Example 2.22 (page 99). There is a moderate positive linear relationship between Inclusive Development Index (IDI) and median per capita income (MI).

The two lines on the plot are the two least-squares regression lines. The regression line for using MI to predict IDI is solid, while the regression line for using IDI to predict MI is dashed. The two regressions give different lines. In the regression setting, you must choose one variable to be the explanatory variable.

Correlation and regression

Even though the correlation r ignores the distinction between explanatory and response variables, there is a close connection between correlation and regression. We saw that the slope of the least-squares line involves r. Another connection between correlation and regression is even more important. In fact, the numerical value of r as a measure of the strength of a linear relationship is best interpreted by thinking about regression.

Example 2.30 Using r2.

The correlation between the MI and the IDI in Example 2.22 (page 99) is r=0.73570, so r2=0.54125. In other words, median per capita daily income explains about 54% of the variability in the Inclusive Development Index.

When you see a correlation, square it to get a better feel for the strength of the association. Perfect correlation (r=−1 or r=1) means that the points lie exactly on a line. Then r2=1 and all the variation in one variable is accounted for by the linear relationship with the other variable. If r=−0.07 or r=0.07, r2=−0.49 and about half the variation is accounted for by the linear relationship. In the r2 scale, correlation ±0.7 is about halfway between 0 and ±0.1. All three software outputs in Figure 2.19 include r2, either in decimal form or as a percent.

Check-in

2.19 What fraction of the variation is explained? Consider the following correlations: −1.0, −0.8, −0.4, −0.2, 0, 0.2, 0.4, and 0.8. For each, give the fraction of the variation in y that is explained by the least-squares regression of y on x. Summarize what you have found from performing these calculations.

Interpretation of r2

Here is an explanation of why the square of the correlation r describes the variation explained by the least-squares regression. Think about trying to predict a new value of y. With no other information than our sample of values of y, a reasonable choice is y¯.

Now consider how your prediction would change if you had an explanatory variable. If we use the regression equation for the prediction, we use the equation y^=b0+b1x. This prediction takes into account the value of the explanatory variable x.

Let’s compare our two choices for predicting y. With the explanatory variable x, we use y^; without this information, we use y¯, the sample mean of the response variable. How can we compare these two choices? When we use y¯ to make a prediction, the difference between our observation and the prediction is y−y¯. If, instead, we use y^, the distance is y−y^. The use of x in our prediction changes our distance from y−y¯ to y−y^; thus the difference is y^−y¯. Our comparison uses the sums of squares of these differences ∑ (y−y¯)2 and ∑ (y^−y¯)2. The ratio of these two quantities turns out to be (after some mathematical manipulations) the square of the correlation:

r2=∑ (y^−y¯)2∑ (y−y¯)2

The numerator represents the variation in y that is explained by x, and the denominator represents the total variation in y. In Chapter 11, where y^ is a combination of several explanatory variables, we use R2 to denote this quantity.

Section 2.4 SUMMARY

A regression line is a straight line that describes how a response variable y changes as an explanatory variable x changes.
The most common method of fitting a line to a scatterplot is least squares. The least-squares regression line is the straight line y^=b0+b1x that minimizes the sum of the squares of the vertical distances of the observed y-values from the line.
You can use a regression line to predict the value of y for any value of x by substituting this x into the equation of the line.
The slope b1 of a regression line y^=b0+b1x is the rate at which the predicted response y^ changes along the line as the explanatory variable x changes. Specifically, b1 is the change in y^ when x increases by 1. The numerical value of the slope depends on the units used to measure x and y.
The intercept b0 of a regression line y^=b0+b1x is the predicted response y^ when the explanatory variable x=0.
The least-squares regression line of y on x is the line with slope b1=rsy/sx and intercept b0=y¯−b1x¯. This line always passes through the point (x¯,y¯).
Correlation and regression are closely connected. The correlation r is the slope of the least-squares regression line when we measure both x and y in standardized units. The square of the correlation r2 is the fraction of the variation of one variable that is explained by least-squares regression on the other variable.

Now that you have completed this section, you will be able to:

Find the equation of the least-squares regression line and draw it on a scatterplot of a set of data. Review Examples 2.23 (page 100), 2.26, and 2.27 (page 103) and try Exercise 2.49.
Predict a value of the response variable y for a given value of the explanatory variable x using a regression equation. Review Example 2.24 (page 101) and try Exercise 2.49.
Read the output of statistical software to find the equation of the least-squares regression line and the value of r2. Review Example 2.28 (page 104) and try Exercise 2.63.
Explain the meaning of r2 in the regression setting. Review Example 2.30 (page 107) and try Exercise 2.63.

Section 2.4 EXERCISES

2.49 Blueberries and anthocyanins. In Exercise 2.8 (page 88), you examined the relationship between Antho3 and Antho4, two anthocyanins found in blueberries. In Exercise 2.30 (page 96), you found the correlation between these two variables.
1. Find the equation of the least-squares regression line for predicting Antho3 from Antho4.
2. Make a scatterplot of the data with the fitted line.
3. How well does the line fit the data? Explain your answer.
4. Use the line to predict the value of Antho3 when Antho4 is equal to 1.7.
2.50 Fuel consumption. In Exercise 2.11 (page 88), you examined the relationship between CO2 emissions and highway fuel consumption for 502 vehicles that use regular fuel. In Exercise 2.32 (page 96), you found the correlation between these two variables.
1. Find the equation of the least-squares regression line for predicting CO2 emissions from highway fuel consumption.
2. Make a scatterplot of the data with the fitted line.
3. How well does the line fit the data? Explain your answer.
4. Use the line to predict the value of CO2 for vehicles that consume 8.0 liters per 100 kilometers (L/100 km).
2.51 Fuel consumption for different types of vehicles. In Exercise 2.13 (page 89), you examined the relationship between CO2 emissions and highway fuel consumption for 1045 vehicles. You used different plotting symbols for the four different types of fuel used by these vehicles: regular, premium, diesel, and ethanol.
1. Find the least-squares regression equation for predicting CO2 emissions from highway fuel consumption for all 1045 vehicles.
2. Make a scatterplot of the data with the fitted line.
3. Based on what you learned from Exercise 2.13, do you think that a single least-squares regression line provides a good fit for all four types of vehicles? Explain your answer.
2.52 Bone strength. Exercise 2.14 (page 89), gives the bone strengths of the dominant and the nondominant arms for 15 men who were controls in a study.
1. Plot the data. Use the bone strength in the nondominant arm as the explanatory variable and bone strength in the dominant arm as the response variable.
2. The least-squares regression line for these data is
  
  dominant =2.74+(0.936× nondominant)
  
  Add this line to your plot.
3. Use the scatterplot (a graphical summary) with the least-squares line (a graphical display of a numerical summary) to write a short paragraph describing this relationship.
2.53 Bone strength for baseball players. Refer to the previous exercise. Similar data for baseball players are given in Exercise 2.15 (page 89). Here is the equation of the least-squares line for the baseball players:

dominant =0.886+(1.373× nondominant)

Answer parts (a) and (c) of the previous exercise for these data.
2.54 Predict the bone strength. Refer to Exercise 2.52. A young male who is not a baseball player has a bone strength of 16.0 Nm/1000 in his nondominant arm. Predict the bone strength in the dominant arm for this person.
2.55 Predict the bone strength for a baseball player. Refer to Exercise 2.53. A young male who is a baseball player has a bone strength of 16.0 Nm/1000 in his nondominant arm. Predict the bone strength in the dominant arm for this person.
2.56 Compare the predictions. Refer to the two previous exercises. You have predicted two dominant-arm bone strengths: one for a baseball player and one for a person who is not a baseball player. The nondominant bone strengths are both 16.0 Nm/1000.
1. Compare the two predictions by computing the difference in means, baseball player minus control.
2. Explain how the difference in the two predictions is an estimate of the effect of baseball throwing exercise on the strength of arm bones.
3. For nondominant arm strengths of 12 Nm/1000 and 20 Nm/1000, repeat your predictions and take the differences. Make a table of the results of all three calculations (for 12, 16, and 20 Nm/1000).
4. Write a short summary of the results of your calculations for the three different nondominant-arm strengths. Be sure to include an explanation of why the differences are not the same for the three nondominant-arm strengths.

2.57 Least-squares regression for radioactive decay. Refer to Exercise 2.22 (page 90) for the data on radioactive decay of barium-137m. Here are the data:

Time	1	3	5	7
Count	578	317	203	118

Using the least-squares regression equation

count =602.8−(74.7× time)

find the predicted values for the counts.
Compute the differences, observed count minus predicted count. How many of these are positive? How many are negative?
Square and sum the differences that you found in part (b).
Repeat the calculations that you performed in parts (a), (b), and (c) using the equation

count =500−(100× time)
In a short paragraph, explain the least-squares idea using the calculations that you performed in this exercise.

2.58 Least-squares regression for the log counts. Refer to Exercise 2.23 (page 90), where you analyzed the radioactive decay of barium-137m data using log counts. Here are the data:

Time	1	3	5	7
Log count	6.35957	5.75890	5.31321	4.77068

Using the least-squares regression equation

log count =6.593−(0.2606× time)

find the predicted values for the log counts.
Compute the differences, observed count minus predicted count. How many of these are positive? How many are negative?
Square and sum the differences that you found in part (b).
Repeat the calculations that you performed in parts (a) to (c) using the equation

log count =7−(0.2× time)
In a short paragraph, explain the least-squares idea using the calculations that you performed in this exercise.

2.59 College students by state. How well does the population of a state predict the number of undergraduates? The National Center for Education Statistics collects data for each of the 50 U.S. states that we can use to address this question.¹⁸
1. Make a scatterplot with population on the x axis and number of undergraduates on the y axis.
2. Describe the form, direction, and strength of the relationship. Are there any outliers?
3. For the number of undergraduates, the mean is 302,136 and the standard deviation is 358,460, and for population, the mean is 5,955,551 and the standard deviation is 6,620,733. The correlation between the number of undergraduates and the population is 0.98367. Use this information to find the least-squares regression line. Show your work.
4. Add the least-squares line to your scatterplot.
2.60 College students by state without the four largest states. Refer to the previous exercise. Let’s eliminate the four largest states, which have populations greater than 15 million. Here are the numerical summaries: for number of undergraduate college students, the mean is 220,134 and the standard deviation is 165,270; for population, the mean is 4,367,448 and the standard deviation is 3,310,957. The correlation between the number of undergraduate college students and the population is 0.97081. Use this information to find the least-squares regression line. Show your work.
2.61 Make predictions and compare. Refer to the two previous exercises. Consider a state with a population of 4 million. (This value is approximately the median population for the 50 states.)
1. Using the least-squares regression equation for all 50 states, find the predicted number of undergraduate college students.
2. Do the same using the least-squares regression equation for the 46 states with populations less than 15 million.
3. Compare the predictions that you made in parts (a) and (b). Write a short summary of your results and conclusions. Pay particular attention to the effect of including the four states with the largest populations in the prediction equation for a median-sized state.
2.62 College students by state. Refer to Exercise 2.59, where you examined the relationship between the number of undergraduate college students and the populations for the 50 states. Figure 2.21 gives the output from a software package for the regression. Use this output to answer the following questions.
1. What is the equation of the least-squares regression line?
2. What is the value of r2?
3. Interpret the value of r2.
4. Does the software output tell you that the relationship is linear and not, for example, curved? Explain your answer.
Figure 2.21 SPSS output for predicting number of undergraduate college students, using the population for the 50 U.S. states, Exercise 2.62.

The output window shows two tables with the following data. First table. Model summary. Model, 1. R, 0.0984 asterisk. R square, 0.968. Adjusted R square, 0.967. Standard error of the estimate, 65178.746. a, predictors, constant, population. Coefficients. Model, 1, constant. Unstandardized coefficients B, negative 15044.917. Unstandardized coefficients standard error, 12454.662. Standardized coefficients beta, blank. t, negative 1.208. Significance, 0.233. Model, population. Unstandardized coefficients B, 0.053. Unstandardized coefficients standard error, 0.001. Standardized coefficients beta, 0.984. t, 37.869. Significance, 0.000. a, dependent variable, undergrads.
2.63 College students by state without the four largest states. Refer to Exercise 2.60, where you eliminated the four largest states that have populations greater than 15 million. Figure 2.22 gives software output for these data. Answer the questions in the previous exercise for the data set with the 46 states.

Figure 2.22 SPSS output for predicting number of undergraduate college students using population, with the four largest states deleted, Exercise 2.63.

The output window shows two tables with the following data. First table. Model summary. Model, 1. R, 0.0971 asterisk. R square, 0.942. Adjusted R square, 0.941. Standard error of the estimate, 40085.795. a, predictors, constant, population. Coefficients. Model, 1, constant. Unstandardized coefficients B, negative 8491.907. Unstandardized coefficients standard error, 9852.117. Standardized coefficients beta, blank. t, 0.862. Significance, 0.393. Model, population. Unstandardized coefficients B, 0.048. Unstandardized coefficients standard error, 0.002. Standardized coefficients beta, 0.971. t, 26.850. Significance, 0.000. a, dependent variable, undergrads.

2.64 Data generated by software. The following 20 observations on Y and X were generated by a computer program:

X	Y	X	Y
23.07	35.49	18.85	28.17
19.88	30.38	19.96	31.17
18.83	26.13	17.87	27.74
22.09	31.85	20.20	30.01
17.19	26.77	20.65	29.61
20.72	29.00	20.32	31.78
18.10	28.92	21.37	32.93
18.01	26.30	17.31	30.29
18.69	29.49	23.50	28.57
18.05	31.36	22.02	29.80

Make a scatterplot and describe the relationship between Y and X.
Find the equation of the least-squares regression line and add the line to your plot.
What percent of the variability in Y is explained by X?
Summarize your analysis of these data in a short paragraph.

2.65 Add an outlier. Refer to Exercise 2.64. Add an additional observation with y=25 and x=35 to the data set. Repeat the analysis that you performed in Exercise 2.64 and summarize your results, paying particular attention to the effect of this outlier.
2.66 Add a different outlier. Refer to the previous two exercises. Add an additional observation with y=36 and x=30 to the original data set.
1. Repeat the analysis that you performed in Exercise 2.64 and summarize your results, paying particular attention to the effect of this outlier.
2. In this exercise and in the previous one, you added an outlier to the original data set and reanalyzed the data. Write a short summary of the changes in correlations that can result from different kinds of outliers.
2.67 Alcohol and calories in beer. Figure 2.12 (page 90) gives a scatterplot of calories versus percent alcohol in 160 brands of domestic beer.
1. Find the equation of the least-squares regression line for these data.
2. Find the value of r2 and interpret it in the regression context.
3. Write a short report on the relationship between calories and percent alcohol in beer. Include graphical and numerical summaries for each variable separately as well as graphical and numerical summaries for the relationship in your report.
2.68 Alcohol and calories in beer revisited. Refer to the previous exercise. The data that you used includes an outlier.
1. Remove the outlier and answer parts (a), (b), and (c) for the new set of data.
2. Write a short paragraph about the possible effects of outliers on a least-squares regression line and the value of r2, using this example to illustrate your ideas.

2.69 Always plot your data! Table 2.1 presents four sets of data prepared by the statistician Frank Anscombe to illustrate the dangers of calculating without first plotting the data.¹⁹

Without making scatterplots, find the correlation and the least-squares regression line for all four data sets. What do you notice? Use the regression line to predict y for x=10.
Make a scatterplot for each of the data sets and add the regression line to each plot.
In which of the four cases would you be willing to use the regression line to describe the dependence of y on x? Explain your answer in each case.

Table 2.1 Four data sets for exploring correlation and regression
Data Set A
x	10	8	13	9	11	14	6	4	12	7	5
y	8.04	6.95	7.58	8.81	8.33	9.96	7.24	4.26	10.84	4.82	5.68
Data Set B
x	10	8	13	9	11	14	6	4	12	7	5
y	9.14	8.14	8.74	8.77	9.26	8.10	6.13	3.10	9.13	7.26	4.74
Data Set C
x	10	8	13	9	11	14	6	4	12	7	5
y	7.46	6.77	12.74	7.11	7.81	8.84	6.08	5.39	8.15	6.42	5.73
Data Set D
x	8	8	8	8	8	8	8	8	8	8	19
y	6.58	5.76	7.71	8.84	8.47	7.04	5.25	5.56	7.91	6.89	12.50

2.70 Progress in math scores. Every few years, the National Assessment of Educational Progress asks a national sample of eighth-graders to perform the same math tasks. The goal is to get an honest picture of progress in math. Here are a few national mean scores, on a scale of 0 to 500:²⁰

Year	1990	2000	2009	2017	2019
Score	263	273	283	283	282

Make a time plot of the mean scores. This is just a scatterplot of score against year. There is a slow linear increasing trend.
Find the regression line of mean score on time step-by-step. First calculate the mean and standard deviation of each variable and their correlation. Then find the equation of the least-squares line from these. Draw the line on your scatterplot. What percent of the year-to-year variation in scores is explained by the linear trend?
Now use software to verify your regression line.
Does the regression line give a good fit to the data? Explain your answer.

2.71 The regression equation. The equation of a least-squares regression line is y=25+8x.
1. What is the value of y for x=−3?
2. If x increases by one unit, what is the corresponding change in y?
3. What is the intercept for this equation?
2.72 Metabolic rate and lean body mass. Compute the mean and the standard deviation of the metabolic rates and lean body masses in Exercise 2.27 (page 91) and the correlation between these two variables. Use these values to find the slope of the regression line of metabolic rate on lean body mass. Also find the slope of the regression line of lean body mass on metabolic rate. What are the units for each of the two slopes?
2.73 Use an applet for progress in math scores. Go to the Two-Variable Statistical Calculator applet. Enter the data for the progress in math scores from Exercise 2.70. Using only the results provided by the applet, write a short report summarizing the analysis of these data.
2.74 A property of the least-squares regression line. Use the equation for the least-squares regression line to show that this line always passes through the point (x¯,y¯).
2.75 Class attendance and grades. A study of class attendance and grades among first-year students at a state university showed that, in general, students who missed a higher percent of their classes earned lower grades. Class attendance explained 25% of the variation in grade index among the students. What is the numerical value of the correlation between percent of classes attended and grade index?