10.2 More Detail about Simple Linear Regression in Chapter 10 Inference for Regression

10.2 More Detail about Simple Linear Regression

When you complete this section, you will be able to

Use ANOVA table output to perform the ANOVA F test and draw appropriate conclusions regarding H0: β1=0.
Use ANOVA table output to compute the square of the sample correlation and provide an interpretation of it in terms of explained variation.
Perform, using a calculator or spreadsheet, inference in simple linear regression when software is not available.
Distinguish the formulas for the standard error that we use for a confidence interval for the mean response and the standard error that we use for a prediction interval when x=x*.
Test the hypothesis that there is no linear association, H0: ρ=0, in the population and summarize the results.

In this section, we expand on three topics related to simple linear regression. The first is analysis of variance for regression. If you plan to read Chapter 11 on multiple regression or Chapters 12 and 13 on comparing several means, this information is helpful preparation. The second topic concerns the computations for regression inference. Even though we recommend using software for analysis, knowing the formulas provides some additional insights. To conclude, we discuss inference for correlation and its close connection to inference about the slope.

Analysis of variance for regression

The usual computer output for regression includes an additional block of calculations that is labeled “ANOVA” or “Analysis of Variance.” Analysis of variance, often abbreviated ANOVA, is the term for statistical analyses that break down the variation in the data into separate pieces that correspond to different sources of variation. It is closely related to the conceptual

DATA=FIT+RESIDUAL

framework we discussed earlier (page 519).

The total variation in the response y is expressed by the deviations yi-y¯. If these deviations were all 0, then all observations would be equal, and there would be no variation in the response. When there is variation (i.e., yi are not all equal to y¯), the linear regression model identifies two sources for this variation:

As the explanatory variable x changes, the mean response changes with it along the regression line. For example, in Figure 10.3 (page 518), students averaging 10,000 steps generally have lower BMIs than those averaging 6000 steps. The fitted value y^i estimates the mean response for each xi. The differences y^i-y¯ reflect the variation in mean response due to differences in the xi.
Individual observations vary Normally about their subpopulation mean. This variation is represented by the residuals yi-y^i that record the scatter of the actual observations about the fitted line.

We can express the deviation of any y observation from the mean of the y’s as the sum of these two deviations. Specifically,

(yi-y¯)=(y^i-y¯)+(yi-y^i)

In terms of deviations, this equation expresses the idea that DATA=FIT+RESIDUAL.

Several times we have measured variation by taking an average of squared deviations. If we square each of the preceding three deviations and then sum over all n observations, it can be shown that the sums of squares add:

∑(yi-y¯)2=∑(y^i-y¯)2+∑(yi-y^i)2

We rewrite this equation as

SST=SSM+SSE

where

SST=∑(yi-y¯)2SSM=∑(y^i-y¯)2SSE=∑(yi-y^i)2

The SS in each abbreviation stands for sum of squares, and the T, M, and E stand for total, model, and error, respectively. (“Error” here stands for deviations from the line, which might better be called “residual” or “unexplained variation.”) The total variation, as expressed by SST, is the sum of the variation due to the straight-line model (SSM) and the variation due to deviations from this model (SSE). This partitioning of the variation in the data between two sources is the heart of analysis of variance.

The variation along the line, labeled SSM, is the variation among the predicted responses y^i. The more the mean of y changes in response to a change in x, the larger this term will be. The variation about the line, labeled SSE, measures the size of the scatter of the observed responses yi above and below this line. If all the responses fell exactly on a straight line, the residuals would all be 0. There would be no variation about the line (SSE=0), and the total variation would equal the variation along the line (SST=SSM). The other extreme is when the least-squares slope b1=0. Because the mean of y does not vary with x, SSM=0 and SST=SSE.

If H0: β1=0 were true, there would be no subpopulations, and all of the y’s should be viewed as coming from a single population with mean μy. The variation of the y’s would then be described by the sample variance

sy2=∑(yi-y¯)2n-1

The numerator in this expression is SST. The denominator is the total degrees of freedom, or simply DFT.

Just as the total sum of squares SST is the sum of SSM and SSE, we partition the total degrees of freedom DFT as the sum of DFM and DFE, the degrees of freedom for the model and for the error:

DFT=DFM+DFE

The model has one explanatory variable x, so the degrees of freedom for this source are DFM=1. Because DFT=n-1, this leaves DFE=n-2 as the degrees of freedom for error.

For each source, the ratio of the sum of squares to the degrees of freedom is called the mean square, or simply MS. The general formula for a mean square is

MS=sum of squaresdegrees of freedom

Each mean square is a type of average squared deviation. Mean square total (MST) is just sy2, the sample variance that we would calculate if all of the data came from a single population. Mean square error (MSE) is also familiar to us:

MSE=s2=∑(yi-y^i)2n-2

It is our estimate of σ2, the variance about the population regression line.

Sums of squares, degrees of freedom, and mean squares

Sums of squares represent variation present in the responses. They are calculated by summing squared deviations. Analysis of variance partitions the total variation between sources.

The sums of squares for the linear regression model are related by the formula

SST=SSM+SSE

That is, the total variation is partitioned into two parts: one due to the model and one due to deviations from the model.

Degrees of freedom are associated with each sum of squares. They are related in the same way:

DFT=DFM+DFE

To calculate mean squares, use the formula

MS=sum of squaresdegrees of freedom

In Section 2.4 (page 107) we noted that r2 is the fraction of variation in the values of y that is explained by the least-squares regression of y on x. The sums of squares make this interpretation of r2 precise. Recall that SST=SSM+SSE. It is an algebraic fact that

r2=SSMSST=∑(y^i-y¯)2∑(yi-y¯)2

Because SST is the total variation in y, and SSM is the variation due to the regression of y on x, this equation is the precise statement of the fact that r2 is the fraction of variation in y explained by x in the linear regression.

The ANOVA F test

The null hypothesis H0: β1=0 that y is not linearly related to x can be tested by comparing the mean square model (MSM) with MSE. The ANOVA test statistic is an F statistic,

F=MSMMSE

When H0 is true, this statistic has an F distribution with 1 degrees of freedom in the numerator and n-2 degrees of freedom in the denominator. These degrees of freedom are those of MSM and MSE. When β1≠0, MSM tends to be large relative to MSE. So large values of F are evidence against H0 in favor of the two-sided alternative.

The F distributions are a family of distributions with two parameters: the degrees of freedom of the mean square in the numerator and denominator of the F statistic. The F distributions are another of R. A. Fisher’s contributions to statistics and are called F in his honor. Fisher introduced F statistics for comparing several means. We meet these useful statistics in Chapters 12 and 13.

The numerator degrees of freedom are always mentioned first. Interchanging the degrees of freedom changes the distribution, so the order is important. Our brief notation will be F( j, k) for the F distribution with j degrees of freedom in the numerator and k in the denominator. The F distributions are not symmetric but are right-skewed. The density curve in Figure 10.14 illustrates the shape. Because mean squares cannot be negative, the F statistic takes only positive values, and the F distribution has no probability to the left of 0. The peak of the F density curve is near 1.

A right skewed F distribution curve is plotted between 0 and 6. It rises sharply from to a peak at approximately 0.8, then falls gradually across the graph. — Figure 10.14 The density for the F(9, 10) distribution. The F distributions are skewed to the right.

Tables of F critical values are available for use when software does not give the P-value or you cannot use a software function, such as F.DIST.RT in Excel or pf in R. However, tables of F critical values are awkward because a separate table is needed for every pair of degrees of freedom. Table E in the back of the book contains the F critical values for probabilities p=0.100, 0.050, 0.025, 0.010, and 0.001. For simple linear regression, we use critical values from the table corresponding to 1 degree of freedom in the numerator and n-2 degrees of freedom in the denominator.

The F statistic tests the same null hypothesis as the t statistic for H0: β1=0 that we encountered earlier in this chapter, so it is not surprising that the two are related. It is an algebraic fact that t2=F in this case. For linear regression with one explanatory variable, we prefer the t form of the test because it more easily allows us to test one-sided alternatives and is closely related to the confidence interval for β1.

The ANOVA calculations are displayed in an analysis of variance table, often abbreviated ANOVA table. Here is the format of the table for simple linear regression:

Source	Degrees of freedom	Sum of squares	Mean square	F
Model	1	∑(y^i-y¯)2	SSM/DFM	MSM/MSE
Error	n-2	∑(yi-y^i)2	SSE/DFE
Total	n-1	∑(yi-y¯)2	SST/DFT

Example 10.14 Interpreting software output for BMI and physical activity.

Data set icon for pabmi.

The output generated by Minitab for the physical activity study in Example 10.2 is given in Figure 10.15. Note that Minitab uses the label “Regression” in place of “Model.” Other software packages may use slightly different labels. The F statistic is 17.10; the P-value is given as 0.000, which means P<0.0005. There is strong evidence against the null hypothesis that there is no relationship between BMI and average number of steps per day (PA).

A Minitab output of a regression analysis. — Figure 10.15 Minitab output for the physical activity study, Example 10.14.

The output is titled, regression analysis, B M I versus P A. It lists a regression equation followed by three tables of data as follows. Regression equation. B M I = 29.58 plus negative 0.6555 P A. Coefficients. Term, Constant. Coefficient, 29.58. S E Coefficient, 1.41. T-value, 20.95. P-value, 0.000. V I F, blank. Term, P A. Coefficient, negative 0.655. S E Coefficient, 0.158. T-value, negative 4.13. V I F, 1.00. Model summary. S, 3.65488. R square, 14.85 percent. R square adjusted, 13.99 percent. R square predicted, 10.81 percent. Analysis of variance. term, regression. D F, 1. Adjusted S S, 228.4. Adjusted M S, 228.38. F-value, 17.10. P-value, 0.000. Term, error. D F, 98. Adjusted S S, 1309.1. Adjusted M S, 13.36. Term, total. D F, 99. Adjusted S S, 1537.5.

Now look at the output for the regression coefficients. The t statistic for PA is given as -4.13. If we square this number, we obtain the F statistic (accurate up to roundoff error). The value of r2 is also given in the output. Average number of steps per day explains only 14.9% of the variability in BMI. caution Strong evidence against the null hypothesis that there is no relationship does not imply that a large percentage of the total variability is explained by the model.

Check-in

10.8 Reading linear regression outputs. Figure 10.4 (page 522) shows the regression output from two software packages and Excel. Create a table that lists the labels each output uses in its ANOVA table, the F statistic, its P-value, and r2. Which of the outputs do you prefer? Explain your answer.
10.9 Reading the ANOVA table. For the physical activity study, the regression standard error s=3.655. Describe how to use the ANOVA table results in Figure 10.15 to obtain this value.

Calculations for regression inference

We recommend using statistical software for regression calculations. With time and care, however, the work is feasible with a calculator or spreadsheet. We will use the following example to illustrate how to perform inference for regression analysis using a calculator.

Example 10.15 Umbilical cord diameter and gestational age.

Data set icon for gadia.

Knowing the gestational age (GA) of a fetus is important for biochemical screening tests and planning for successful delivery. Typically, GA is calculated as the number of days since the start of the woman’s last menstrual period (LMP). However, for women with irregular periods, GA is difficult to compute, and ultrasound imaging is often used. In the search for helpful ultrasound measurements, a group of Nigerian researchers looked at the relationship between umbilical cord diameter (mm) and gestational age based on LMP (weeks).⁹ Here is a small subset of the data:

Umbilical cord diameter (x)	2	6	9	14	21	23
Gestational age (y)	16	18	26	33	28	39

The data and the least-squares regression line are plotted in Figure 10.16. The strong straight-line pattern suggests that we can use linear regression to model the relationship between diameter and gestational age.

A scatterplot of gestational age versus diameter. — Figure 10.16 Scatterplot and least-squares regression line, Example 10.15.

We begin our regression calculations by fitting the least-squares line. Fitting the line gives estimates b1 and b0 of the model parameters β1 and β0. Next, we calculate the residuals and obtain an estimate s of the remaining parameter σ. These calculations are preliminary to inference. If the data approximately meet the model conditions, we use s to obtain the standard errors needed for the various interval estimates and significance tests. caution Roundoff errors that accumulate during these calculations can ruin the final results. Be sure to carry many significant digits and check your work carefully. For our work here, we will carry things to the fifth decimal place.

Preliminary calculations

Because the scatterplot (Figure 10.16) suggests a straight-line pattern, we begin by fitting the least-squares line.

Example 10.16 Summary statistics for gestational age study.

Data set icon for gadia.

We start by making a table with the mean and standard deviation for each of the variables, the correlation, and the sample size. These calculations should be familiar from Chapters 1 and 2. Here is the summary:

Variable	Mean	Standard deviation	Correlation	Sample size
Diameter	x¯=12.5	sx=8.36062	r=0.87699	n=6
Gestational age	y¯=26.66667	sy=8.75595

These quantities are the building blocks for our calculations.

We will need one additional quantity for the eventual standard error calculations. It is the expression ∑(xi-x¯)2. We obtain this quantity as an intermediate step when we calculate sx. If provided sx, you can find it using the fact that ∑(xi-x¯)2=(n-1)sx2. You should verify that the value for our example is

∑(xi-x¯)2=(2-12.5)2+(6-12.5)2+⋯+(23-12.5)2=349.5

Example 10.17 Computing the least-squares regression line.

Using the summary statistics provided in Example 10.16 and the formulas on pages 520–521, the slope of the least-squares line is

b1=rsysx=0.876998.755958.36062=0.91846

The intercept is

b0=y¯-b1x¯=26.66667-(0.91846×12.5)=15.18592

The equation of the least-squares regression line is therefore

y^=15.18592+0.91846x

This is the line shown in Figure 10.16.

Now that we have estimates of the first two parameters, β0 and β1, of our linear regression model, we find the estimate of the third parameter, σ: the standard deviation about the population line. To do this, we need to find the predicted values and then the residuals.

Example 10.18 Computing the predicted values and residuals.

Data set icon for gadia.

The first observation is a diameter of x=2. The corresponding predicted value of gestational age is

y^1=b0+b1x1=15.18592+(0.91846×2)=17.02284

and the residual is

e1=y1-y^1=16-17.02284=-1.02284

The residuals for the other diameters are calculated in the same way. They are -2.69668, 2.54794, 4.95564, -6.47358, and 2.68950, respectively. Notice that the sum of these six residuals is zero (except for some roundoff error). When doing these calculations by hand, it is always helpful to check that the sum of the residuals is close to zero.

Example 10.19 Computing s2.

The estimate of σ2 is s2, the sum of the squares of the residuals divided by n-2. The estimated standard deviation about the line is the square root of this quantity:

s2=∑ei2n-2=(-1.02284)2+(-2.69668)2+⋯+(2.68950)24=22.12732

So the estimate of the standard deviation about the line is

s=22.12732=4.70397

Check-in

10.10 Computing the residuals. In Example 10.18, we computed the residual for the first observation and reported the residuals for the other five observations. Run through the calculations for the second and third observations to verify these values.
10.11 More on the model fitting check. In Example 10.18, we verified our calculations by checking that the residuals sum to zero. If they do not sum to zero, we either made an error in calculating the residuals or in calculating b0. Suppose we translate the second and third decimal places of b1 when computing b0 and report

b0=26.66667−(0.98146×12.5)=14.39842

Calculate the residuals under b1 and this incorrect b0 and verify that the sum of the residuals is not close to zero.

Inference for slope and intercept

Confidence intervals and significance tests for the slope β1 and intercept β0 of the population regression line use the estimates b1 and b0 and their standard errors. Some algebra and the rules for variances establish that the standard deviation of b1 is

σb1=σ∑(xi-x¯)2

Similarly, the standard deviation of b0 is

σb0=σ1n+x¯2∑(xi-x¯)2

To estimate these standard deviations, we need only replace σ by its estimate s.

The plot of the regression line with the data in Figure 10.16 shows a very strong relationship, but our sample size is small. We assess the situation with a significance test for the slope.

Example 10.20 Testing the slope.

First we find the standard error of the estimated slope:

SEb1=s∑(xi-x¯)2=4.70397349.5=0.25162

To test

H0: β1=0Ha: β1≠0

we calculate the t statistic:

t=b1SEb1=0.918460.25162=3.65

Using Table D with n-2=4 degrees of freedom, we conclude that 0.02<P<0.04. The exact P-value obtained from software is 0.022. The data provide evidence in favor of a linear relationship between gestational age and umbilical cord diameter (t=3.65,df=4,0.02<P<0.04).

Two things are important to note about this example. caution First, it is important to remember that we need to have a very large effect if we expect to detect a slope different from zero with a small sample size. The estimated slope is more than 3.5 standard deviations away from zero, but we are not much below the 0.05 standard for statistical significance. Second, because we expect gestational age to increase with increasing diameter, a one-sided significance test would be justified in this setting.

The significance test tells us that the data provide sufficient information to conclude that gestational age and umbilical cord diameter are linearly related. We use the estimate b1 and its confidence interval to further describe the relationship.

Example 10.21 Computing a 95% confidence interval for the slope.

Let’s find a 95% confidence interval for the slope β1. The degrees of freedom are n-2=4, so t* from Table D is 2.776. We compute

b1±t*SEb1=0.91846±(2.776)(0.25162)=0.91846±0.69850

The interval is (0.220, 1.617). For each additional millimeter in diameter, the gestational age of the fetus is expected to be 0.220 to 1.617 weeks older.

In this example, the intercept β0 does not have a meaningful interpretation. An umbilical cord diameter of zero millimeters is not realistic. For problems where inference for β0 is appropriate, the calculations are performed in the same way as those for β1. Note that there is a different formula for the standard error, however.

Confidence intervals for the mean response and prediction intervals for a future observation

When we substitute a particular value x* of the explanatory variable into the regression equation and obtain a value of y^, we can view the result in two ways:

We have estimated the mean response μy.
We have predicted a future value of the response y.

The margins of error for these two uses are often quite different. Recall that prediction intervals for an individual response are wider than confidence intervals for estimating a mean response. We now proceed with the details of these calculations. Once again, standard errors are the essential quantities. And once again, these standard errors are multiples of s, our basic measure of the variability of the responses about the fitted line.

Note that the only difference between the formulas for these two standard errors is the extra 1 under the square root sign in the standard error for prediction. This standard error is larger due to the additional variation of individual responses about the mean response. This additional variation remains regardless of the sample size n and is the reason that prediction intervals are wider than the confidence intervals for the mean response.

For the gestational age example, we can think about the average gestational age for a particular subpopulation, defined by the umbilical cord diameter. The confidence interval would provide an interval estimate of this subpopulation mean. On the other hand, we might want to predict the gestational age for a new fetus. A prediction interval attempts to capture this new observation.

Example 10.22 Computing a confidence interval for μ.

Let’s find a 95% confidence interval for the average gestational age when the umbilical cord diameter is 10 millimeters. The estimated mean age is

μ^=b0+b1x=15.18592+(0.91846×10)=24.37052

The standard error is

SEμ^=s1n+(x*-x¯)2∑(xi-x¯)2=4.7039716+(10.0-12.5)2349.5=2.02079

To find the 95% confidence interval we compute

μ±t*SEμ^=24.37052±(2.776)(2.02079)=24.37052±5.60971

The interval is 18.8 to 30.0 weeks of age. This is a pretty wide interval, given gestation is about 40 weeks.

Calculations for the prediction intervals are similar. The only difference is the use of the formula for SEy^ in place of SEμ^. This results in a much wider interval. In fact, the interval is slightly more than 11 weeks in width. Even though a linear relationship was found to be statistically significant, the estimated model does not provide useful predictions.

Inference for correlation

The correlation coefficient is a measure of the strength and direction of the linear association between two variables. Correlation does not require an explanatory–response relationship between the variables. We can consider the sample correlation r as an estimate of the correlation in the population and base inference about the population correlation on r.

The correlation between the variables x and y when they are measured for every member of a population is the population correlation. As usual, we use Greek letters to represent population parameters. In this case ρ (the Greek letter rho) is the population correlation.

When ρ=0, there is no linear association in the population. In the important case where the two variables x and y are both Normally distributed, the condition ρ=0 is equivalent to the statement that x and y are independent. That is, there is no association of any kind between x and y. (Technically, the condition required is that x and y be jointly Normal variables. This means that the distribution of x is Normal and also that the conditional distribution of y, given any fixed value of x, is Normal.)

Test for a zero population correlation

To test the hypothesis H0: ρ=0, compute the t statistic

t=rn-21-r2

where n is the sample size and r is the sample correlation.

In terms of a random variable T having the t(n-2) distribution, the P-value for a test of H0 against

Ha: ρ>0 is P(T≥t)

Ha: ρ<0 is P(T≤t)

Ha: ρ≠0 is 2P(T≥|t|)

Most computer packages have routines for calculating correlations, and some will provide the significance test for the null hypothesis that ρ is zero.

Example 10.23 Correlation in the physical activity study.

The R output for the physical activity example (page 518) appears in Figure 10.17. The sample correlation between BMI and the average number of steps per day (PA) is r=-0.385. R calls this “Pearson’s product–moment correlation” to distinguish it from other kinds of correlations that it can calculate. The P-value for a two-sided test of H0: ρ=0 is given as 0.000075. We conclude that there is a nonzero correlation between BMI and PA.

An R output for a correlation test. — Figure 10.17 R output for the physical activity study, Example 10.23.

To test the one-sided alternative that the population correlation is negative, we divide the P-value in the output by 2, after checking that the sample coefficient is in fact negative. If your software does not give the significance test, you can do the computations easily with a calculator.

Example 10.24 Correlation test using a calculator.

The correlation between BMI and PA is r=−0.385. Recall that n=100. The t statistic for testing the null hypothesis that the population correlation is zero is

t=rn-21-r2=-0.385100-21-(-0.385)2=-4.13

The degrees of freedom are n-2=98. From Table D we conclude that P<0.0001. This agrees with the R output in Figure 10.17, where the P-value is given as 0.000075. The data provide clear evidence that BMI and PA are related.

There is a close connection between the significance test for a correlation and the test for the slope in a linear regression. Recall that

b1=rsysx

From this fact we see that if the slope is 0, so is the correlation and vice versa. It should come as no surprise that the procedures for testing H0: β1=0 and H0: ρ=0 are also closely related. In fact, the t statistics for testing these hypotheses are numerically equal. That is,

b1SEb1=rn-21-r2

In our example the conclusion that there is a statistically significant correlation between the two variables would not come as a surprise to anyone familiar with the meaning of these variables. The significance test simply tells us whether there is evidence in the data to conclude that the population correlation is different from 0. The actual size of the correlation is of considerably more interest. We would therefore like to give a confidence interval for the population correlation. In Figure 10.17, R reports a 95% confidence interval for ρ: (-0.54,-0.20). Many other software packages do not perform this calculation. Because hand calculation of the confidence interval is very tedious, we do not give the method here.¹⁰

Check-in

10.12 Testing the correlation. The gestational age study of Example 10.16 reports a correlation of r=0.877.
1. Test whether the population correlation is significantly different from zero, using the 0.05 significance level.
2. Compare your t statistic for ρ with the t statistic for β1 in Example 10.20. What do you find?

Section 10.2 SUMMARY

Analysis of variance (ANOVA) for simple linear regression partitions the total variation in the responses between two sources: the linear relationship of y with x and the residual variation in responses for the same x.
An ANOVA table for a linear regression organizes these ANOVA calculations into degrees of freedom, sums of squares, and mean squares for the model, error, and total sources of variation. The ANOVA F statistic is the ratio MSM/MSE. Under H0: β1=0, this statistic has an F(1,n-2) distribution and is used to test H0 versus the two-sided alternative.
The square of the sample correlation can be expressed as

r2=SSMSST

and is interpreted as the proportion of the variability in the response variable y that is explained by the explanatory variable x in the linear regression.
The standard errors for b0 and b1 are

SEb0=s1n+x¯2∑(xi-x¯)2SEb1=s∑(xi-x¯)2
The standard error of μ^, the mean response for the subpopulation corresponding to the value x* of the explanatory variable, is

SEμ^=s1n+(x*-x¯)2∑(xi-x¯)2

This standard error is used in a confidence interval.
The standard error for predicting an individual response y^ from the subpopulation corresponding to the value x* of the explanatory variable is

SEy^=s1+1n+(x*-x¯)2∑(xi-x¯)2

This standard error is used in a prediction interval.
When the variables y and x are jointly Normal, the sample correlation is an estimate of the population correlation ρ. The test of H0: ρ=0 is based on the t statistic

t=rn-21-r2

which has a t(n-2) distribution under H0. This test statistic is numerically identical to the t statistic used to test H0: β1=0.

Now that you have completed this section, you will be able to:

Use ANOVA table output to perform the ANOVA F test and draw appropriate conclusions regarding H0: β1=0. Review Example 10.14 (page 543) and try Exercise 10.25.
Use ANOVA table output to compute the square of the sample correlation and provide an interpretation of it in terms of explained variation. Review Example 10.14 (page 543) and try Exercise 10.23.
Perform, using a calculator or spreadsheet, inference in simple linear regression when software is not available. Review Examples 10.16 through 10.22 (pages 545 to 550) and try Exercise 10.21.
Distinguish the formulas for the standard error that we use for a confidence interval for the mean response and the standard error that we use for a prediction interval when x=x*. Review Example 10.22 (page 550) and try Exercise 10.21.
Test the hypothesis that there is no linear association, H0: ρ=0, in the population and summarize the results. Review Example 10.24 (page 552) and try Exercise 10.23.

Section 10.2 EXERCISES

10.19 What’s wrong? For each of the following, explain what is wrong and why.
1. In simple linear regression, the null hypothesis of the ANOVA F test is H0: β0=0.
2. In an ANOVA table, the mean squares add. In other words, MST=MSM+MSE.
3. The smaller the P-value for the ANOVA F test, the greater the explanatory power of the model.
4. The total degrees of freedom in an ANOVA table are equal to the number of observations n.
10.20 What’s wrong? For each of the following, explain what is wrong and why.
1. In simple linear regression, the standard error for a future observation is s, the measure of spread about the regression line.
2. In an ANOVA table, SSE is the sum of the deviations.
3. There is a close connection between the correlation r and the intercept of the regression line.
4. The squared correlation r2 is equal to MSM/MST.

10.21 Research and development spending. The National Science Foundation collects data on research and development spending by universities and colleges in the United States.¹¹ Here are the data for spending in the years 2013–2016 that was nonfederally funded:

Year	2013	2014	2015	2016
Spending (billions of dollars)	27.6	29.2	30.7	33.0

Do the following by hand or with a calculator and verify your results with a software package or Excel. Data set icon for nsf.

Create a scatterplot that shows the increase in research and development spending over time. Does the pattern suggest that the spending is increasing linearly over time? Explain your answer.
Find the equation of the least-squares regression line for predicting spending over time. Add this line to your scatterplot.
For each of the four years, find the residual. Use these residuals to calculate the regression standard error s.
Write the regression model for this setting. What are your estimates of the unknown parameters in this model?
Use your least-squares results to construct a 95% prediction interval for research and development spending for 2017. The actual spending for that year was $34.9 billion. Comment on how well the model predicted the actual outcome.
Explain why a prediction interval rather than a confidence interval is appropriate for part (e).

(Comment: These are time series data. Simple regression is often a good fit to time series data over a limited span of time.)

10.22 Food neophobia. Food neophobia is a personality trait associated with avoiding unfamiliar foods. In one study of 564 children who were two to six years of age, the degree of food neophobia and the frequency of consumption of different types of food were measured.¹² Here is a summary of the correlations:

Type of food	Correlation
Vegetables	−0.27
Fruit	−0.16
Meat	−0.15
Eggs	−0.08
Sweet/fatty snacks	0.04
Starchy staples	−0.02

Perform the significance test for each correlation and write a summary about food neophobia and the consumption of different types of food.

10.23 Correlation between the prevalences of adult binge drinking and underage drinking. A group of researchers compiled data on the prevalence of adult binge drinking and the prevalence of underage drinking in 42 states.¹³ A correlation of 0.32 was reported.
1. Test the null hypothesis that the population correlation ρ=0 against the alternative ρ>0. Are the results significant at the 5% level?
2. Explain this correlation in terms of the direction of the association and the percent of variability in the prevalence of underage drinking that is explained by the prevalence of adult binge drinking.
3. The researchers collected information from 42 of 50 states, so almost all the data available were used in the analysis. Provide an argument for the use of statistical inference in this setting.

10.24 Grade inflation. The average undergraduate GPA for American colleges and universities was estimated based on a sample of institutions that published this information.¹⁴ Here are the data for public schools in that report:

Year	1992	1996	2002	2007
GPA	2.85	2.90	2.97	3.01

Do the following by hand or with a calculator and verify your results with a software package. Data set icon for gradeup.

Make a scatterplot that shows the increase in GPA over time. Does a linear increase appear reasonable?
Find the equation of the least-squares regression line for predicting GPA from year. Add this line to your scatterplot.
Compute a 95% confidence interval for the slope and summarize what this interval tells you about the increase in GPA over time.

10.25 Completing an ANOVA table. How are returns on common stocks in overseas markets related to returns in U.S. markets? Consider measuring U.S. returns by the annual rate of return on the Standard & Poor’s 500 stock index and overseas returns by the annual rate of return on the Morgan Stanley Europe, Australasia, Far East (EAFE) index.¹⁵ Both are recorded in percents. We will regress the EAFE returns on the S&P 500 returns for the 31 years 1989 to 2019. Here is part of the Minitab output for this regression:

The regression equation is

EAFE = −3.50 + 0.831 S&P

Analysis of Variance

Source DF SS MS F
Regression 1 6427.4
Residual Error
Total 11108.5

Using the ANOVA table format on page 543 as a guide, complete the analysis of variance table.
10.26 Interpreting statistical software output. Refer to the previous exercise. What are the values of the estimated model standard error s and the squared correlation r2?
10.27 Confidence intervals for the slope and intercept. Refer to the previous two exercises. The mean and standard deviation of the S&P 500 returns for these years are 12.11% and 17.61%, respectively. From this and your work in the previous two exercises:
1. Find the standard error for the least-squares slope b1.
2. Give a 95% confidence interval for the slope β1 of the population regression line.
3. Explain why the intercept β0 is meaningful in this example.
4. Find the standard error for the least-squares intercept b0 and use it to construct a 95% confidence interval.