11.2 A Case Study in Chapter 11 Multiple Regression

11.2 A Case Study

In this section, we illustrate multiple regression by analyzing the data from the study described in Example 11.1. There are data for n=150 students. The response variable is the cumulative GPA after three semesters, on a four-point scale. The explanatory variables are average high school grades, represented by HSM, HSS, and HSE. We also consider SAT Mathematics (SATM), SAT Critical Reading (SATCR), and SAT Writing (SATW) scores as explanatory variables. We leave the inclusion of sex (Sex) as an explanatory variable to the exercises.

Before starting the analysis, we must first consider the extent to which our results can be generalized. For this study, all the available data are being analyzed. There is no random sampling from the population of science majors. In this type of setting, we often justify the use of inference by viewing the data as coming from some sort of process. Here, we consider this collection of students as a sample of all the science majors who will attend this university. Still, opinions may vary as to the extent to which these data can be considered an SRS sample of future students. For example, schools seem to consistently brag that their new batch of first-year students is the smartest and most accomplished group they’ve ever had.

Preliminary analysis

As with any other statistical analysis, we begin our multiple regression with a careful examination of the data. We first look at each variable separately, and then we look at relationships among the variables. In both cases, we continue our practice of combining plots and numerical descriptions. We use a mix of JMP, Excel, and Minitab to illustrate the outputs that are given by most software.

Example 11.3 Numerical summaries of each variable.

Data set icon for gpa.

Means, standard deviations, and minimum and maximum values appear in Figure 11.2. The minimum value for high school mathematics (HSM) appears to be rather extreme; it is 4.51 standard deviations below the mean ((2.00−8.59)/1.46=−4.51). Similarly, the minimum value for GPA is 3.43 standard deviations below the mean. We do not discard either of these cases at this time but will take care in our subsequent analyses to see if they have an excessive influence on our results.

A JMP output of descriptive statistics. — Figure 11.2 Descriptive statistics for the academic success case study, Example 11.3.

The output shows an expanded dropdown list menu, tabulate. Below is a table of data. The table has 7 rows and 6 columns. The columns have the following headings from left to right. Item, N, Mean, Standard Deviation, Minimum, Maximum. The row entries are as follows. Row 1. Item, G P A. N, 150. Mean, 2.8421333. Standard Deviation, 0.8178992. Minimum, 0.03. Maximum, 4. Row 2. Item, H S M. N, 150. Mean, 8.5866667. Standard Deviation, 1.4617571. Minimum, 2. Maximum, 10. Row 3. Item, H S S. N, 150. Mean, 8.8. Standard Deviation, 1.3951017. Minimum, 4. Maximum, 10. Row 4. Item, H S E. N, 150. Mean, 8.8333333. Standard Deviation, 1.2660601. Minimum, 4. Maximum, 10. Row 5. Item, S A T M. N, 150. Mean, 623.6. Standard Deviation, 74.8356589. Minimum, 460. Maximum, 800. Row 6. Item, S A T C R. N, 150. Mean, 573.8. Standard Deviation, 87.6208274. Minimum, 330. Maximum, 800. Row 7. Item, S A T W. N, 150. Mean, 562.6. Standard Deviation, 80.0874522. Minimum, 350. Maximum, 770.

Software Output

The mean for the SATM score is higher than the means for the Critical Reading (SATCR) and Writing (SATW) scores, as we might expect for a group of science majors. The three SAT standard deviations are all about the same.

Although mathematics scores were higher on the SAT, the means and standard deviations of the three high school grade variables are very similar. Because the level and difficulty of high school courses vary within and across schools, this may not be that surprising. The mean GPA is 2.842 on a four-point scale, with standard deviation 0.818.

It is often easier to study a variable visually than to look at a set of numerical summaries. For quantitative variables, we can use boxplots and histograms to examine the shapes of their distributions and identify extreme values. For categorical variables, we can use bar plots or pie charts.

Example 11.4 Graphical summary of each variable.

Data set icon for gpa.

Because the variables GPA, SATM, SATCR, and SATW have many possible values, we could use boxplots or histograms to examine their shapes. For GPA (not shown), the distribution is strongly skewed to the left so that its minimum value does not look as extreme anymore.

The high school grade variables HSM, HSS, and HSE take only integer values. The bar plots using relative frequencies are shown in Figure 11.3. The distributions are all skewed, with a large proportion of high grades (10=A and 9=A−). The minimum value of HSM still appears to be rather extreme.

Three bar plots of relative frequencies for variables H S M, H S S, and H S E. — Figure 11.3 Bar plots using relative frequencies, Example 11.4.

The purpose of examining these numerical and visual summaries is to understand the features of each variable before attempting to use it in a complicated model. caution Extreme values of any variable should be noted and checked for accuracy. If they are found to be correct, the cases with these values should be carefully examined to see if they are truly exceptional and perhaps do not belong in the same analysis with the other cases. We will see that when our data are examined in this way, no obvious problems are evident.

It should also be noted that this preliminary analysis did not involve checks of Normality using Normal quantile plots. Just as with simple linear regression, caution the multiple regression model does not require any of these observed distributions to be Normal. Only the deviations of the responses y from their means are assumed to be Normal. This condition can be assessed only after we’ve fit the model.

Check-in

11.3 Examining the distributions of the other variables. Use boxplots or histograms to examine the distributions of GPA, SATM, SATCR, and SATW. Describe the shapes and comment on any extreme values.

Relationships between pairs of variables

The second step in our analysis is to examine the relationships between all pairs of variables. Scatterplots and correlations are our tools for studying two-variable relationships.

Example 11.5 Examining the correlations between pairs of variables.

Data set icon for gpa.

The correlations appear in Figure 11.4. As we might expect, high school math and science grades have the highest correlation with GPA (r=0.420 and r=0.443), followed by English grades (0.359) and then SAT Mathematics (0.330). SAT Critical Reading (SATCR) and SAT Writing (SATW) have comparable, somewhat weak, correlations with GPA. On the other hand, SATCR and SATW have a high correlation with each other (0.734). The high school grades also correlate well with each other (0.485 to 0.695). SATM correlates well with the other SAT scores (0.579 and 0.551), somewhat with HSM (0.325), less with HSS (0.215), and poorly with HSE (0.134). SATCR and SATW do not correlate well with any of the high school grades (0.072 to 0.259).

Some software output includes the P-value for the test of the null hypothesis that the population correlation is 0 versus the two-sided alternative for each pair. When you have a large sample size, this information is not that helpful as even somewhat weak associations are found to be statistically significant. For example, with n=150 pairs, we can use the t statistic to test H0: ρ=0 (page 551) to show that any r≥0.16 will be found statistically significant. Only HSM with SATCR, HSM with SATW, and HSE with SATM are not found statistically significant.

A Minitab output of correlation data. — Figure 11.4 Correlations among case study variables, Example 11.5.

The output shows a table of data. The table has 6 rows and 7 columns. The columns have the following headings from left to right. Variable, G P A, H S M, H S S, H S E, S A T M, S A T C R. The row entries are as follows. Row 1. Variable, H S M. G P A, 0.42. H S M, blank. H S S, blank. H S E, blank. S A T M, blank. S A T C R, blank. Row 2. Variable, H S S. G P A, 0.443. H S M, 0.67. H S S, blank. H S E, blank. S A T M, blank. S A T C R, blank. Row 3. Variable, H S E. G P A, 0.359. H S M, 0.485. H S S, 0.695. H S E, blank. S A T M, blank. S A T C R, blank. Row 4. Variable, S A T M. G P A, 0.33. H S M, 0.325. H S S, 0.215. H S E, 0.134. S A T M, blank. S A T C R, blank. Row 5. Variable, S A T C R. G P A, 0.251. H S M, 0.15. H S S, 0.215. H S E, 0.259. S A T M, 0.579. S A T C R, blank. Row 6. Variable, S A T W. G P A, 0.223. H S M, 0.072. H S S, 0.161. H S E, 0.185. S A T M, 0.551. S A T C R, 0.734.

It is important to keep in mind that, by examining pairs of variables, we are seeking a better understanding of the data. caution The fact that the correlation of a particular explanatory variable with the response variable does not achieve statistical significance does not necessarily imply that it will not be a useful (and statistically significant) predictor in a multiple regression model.

Numerical summaries such as correlations are useful, but plots are generally more informative when seeking to understand data. Plots tell us whether the numerical summary gives a fair representation of the data. For a multiple regression, each pair of variables should be plotted. For the seven variables in our case study, this means that we should examine 21 plots. In general, there are p+1 variables in a multiple regression analysis with p explanatory variables, so that p(p+1)/2 plots are required. caution Multiple regression is a complicated procedure. If we do not do the necessary preliminary work, we are in serious danger of producing useless or misleading results.

Check-in

11.4 Pairwise relationships among variables in the GPA data set. Most statistical software packages have the option to create a “scatterplot matrix” of all p(p+1)/2 scatterplots. For example, in JMP, there is the option “Scatterplot Matrix” under the Graph menu. Figure 11.5 is the scatterplot matrix for the GPA data, including the least-squares lines. Comment on any unusual patterns or observations. Do you think the pairwise correlations in Figure 11.4 or the scatterplot matrix in Figure 11.5 better describes the pairwise relationships? Explain your answer.

Figure 11.5 Scatterplot matrix for the academic success case study, including the least-squares lines, Check-in question 11.4.

The output shows 21 scatterplots for 6 different variables. By variable, the plots are as follows. Variable, S A T W, six plots. The first plots S A T C R on the vertical axis versus S A T W on the horizontal. Points are plotted in a tight cluster that rises linearly from left to right. A regression line rises diagonally through it from (350, 350) to (850, 750). The second plots S A T M on the vertical axis versus S A T W on the horizontal. Points are plotted in a loose liner cluster that rises linearly from left to right. A regression line rises diagonally through the center from (350, 500) to (850, 750). The third plots H S E on the vertical axis versus S A T W on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 9.5). The fourth plots H S S on the vertical axis versus S A T W on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 9.5). The fifth plots H S M on the vertical axis versus S A T W on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 8.5). The sixth plots G P A on the vertical axis versus S A T W on the horizontal. Points are plotted high and wide in a loose cluster with a slight linear pattern. A regression line rises diagonally left to right through the center of the cluster from (350, 2.5) to (850, 3.5). Variable, S A T C R, five plots. The first plots S A T M on the vertical axis versus S A T C R on the horizontal. Points are plotted in a loose cluster that rises linearly from left to right. A regression line rises through the center of the cluster from (350, 500) to (850, 850). The second plots H S E on the vertical axis versus S A T C R on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 10). The third plots H S S on the vertical axis versus S A T C R on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 10). The fourth plots H S M on the vertical axis versus S A T C R on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 9). The fifth plots G P A on the vertical axis versus S A T C R on the horizontal. Points are plotted high and wide in a loose cluster with a slight linear pattern. A regression line rises diagonally left to right through the center of the cluster from (350, 2.5) to (850, 4). Variable, S A T M, four plots. The first plots H S E on the vertical axis versus S A T M on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8.5) to (850, 9.5). The second plots H S S on the vertical axis versus S A T M on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8.5) to (850, 10). The third plots H S M on the vertical axis versus S A T M on the horizontal. Points are plotted in a series of straight, horizontal clusters. A regression line rises diagonally left to right through the center of the points from (350, 8) to (850, 10). The fourth plots G P A on the vertical axis versus S A T M on the horizontal. Points are plotted high and wide in a loose cluster with a slight linear pattern. A regression line rises diagonally left to right through the center of the cluster from (350, 2.5) to (850, 4). Variable, H S E, three plots. The first plots H S S on the vertical axis versus H S E on the horizontal. Points are plotted in a series of straight, vertical clusters. A regression line rises diagonally left to right through the center of the points from (3.5, 5) to (10, 10). The second plots H S M on the vertical axis versus H S E on the horizontal. Points are plotted in a series of straight, vertical clusters. A regression line rises diagonally left to right through the center of the points from (3.5, 6) to (10, 10). The third plots G P A on the vertical axis versus H S E on the horizontal. Points are plotted in a series of straight, vertical clusters. A regression line rises diagonally left to right through the center of the cluster from (3.5, 2) to (10, 3.5). Variable, H S S, two plots. The first plots H S M on the vertical axis versus H S S on the horizontal. Points are plotted in a series of straight, vertical clusters. A regression line rises diagonally left to right through the center of the points from (3.5, 5) to (10, 10). The second plots G P A on the vertical axis versus H S S on the horizontal. Points are plotted in a series of straight, vertical clusters. A regression line rises diagonally left to right through the center of the cluster from (3.5, 2) to (10, 3.5). Variable, H S M, one plot. The graph plots G P A on the vertical axis versus H S M on the horizontal. Points are plotted in a series of straight, vertical clusters. A regression line rises diagonally left to right through the center of the cluster from (3.5, 1.5) to (10, 3.5). All values estimated.

Fitting a multiple regression model

Data set icon for Vtm.

To explore the relationship between the explanatory variables and our response variable GPA, we run several multiple regressions. The explanatory variables fall into three classes. High school grades are represented by HSM, HSS, and HSE; standardized tests are represented by the three SAT scores; and sex of the student is represented by Sex. We begin our analysis by using the high school grades to predict GPA.

Example 11.6 Regression on high school grades.

Data set icon for gpa.

Figure 11.6 gives the multiple regression output when using the three high school grades as explanatory variables. The output contains an ANOVA table, some additional fit statistics, and information about the parameter estimates. Because there are n=150 cases, we have DFT=n−1=149. The three explanatory variables give DFM=p=3 and DFE=n−p−1=150−3−1=146. These values match those in the output.

The ANOVA F statistic is 14.35, with a P-value of <0.0001. Under the null hypothesis

H0: β1=β2=β3=0

the F statistic has an F(3, 146) distribution. According to this distribution, the chance of obtaining an F statistic of 14.35 or larger is less than 0.0001. Therefore, we conclude that at least one of the three regression coefficients for the high school grades is different from 0 in this population regression equation.

A JMP output of a regression analysis. — Figure 11.6 Multiple regression output for regression using high school grades to predict GPA, Example 11.6.

The output shows three expanded dropdown list menus, response G P A, whole model, and summary of fit, below which is a table with the following data. R square, 0.227739. R square adjusted, 0.211871. Root mean square error, 0.726103. Mean of response, 2.842133. Observations, or sum weights, 150. Another expanded menu, analysis of variance, shows a table with the following data. Source, model. D F, 3. Sum of squares, 22.699889. Mean square, 7.56663. F ratio, 14.3518. Source, error. D F, 146. Sum of squares, 76.975028. Mean square, 0.52723. Source, C total. D F, 149. Sum of squares, 99.674917. Probability greater than F, less than 0.0001 asterisk. Another dropdown menu, parameter estimates, shows a table with the following data. Term, intercept. Estimate, 0.0693035. Standard error, 0.453656. t ratio, 0.15. Probability greater than absolute value of t, 0.8788. Term, H S M. estimate, 0.1232462. Standard error, 0.054855. t ratio, 2.25. Probability greater than absolute value of t, 0.0262 asterisk. Term, H S S. Estimate, 0.1361368. Standard error, 0.069951. t ratio, 1.95. Probability greater than absolute value of t, 0.0536. Term, H S E. Estimate, 0.0584775. Standard error, 0.065417. t ratio, 0.89. Probability greater than absolute value of t, 0.3728.

In the fit statistics that precede the ANOVA table, we find that Root MSE is 0.726. This value is the square root of the MSE given in the ANOVA table and is s, the estimate of the parameter σ of our model. The value of R2 is 0.23. That is, 23% of the observed variation in the GPA scores is explained by linear regression on high school grades.

Although the P-value of the F test is very small, the model does not explain very much of the variation in GPA. Remember, a small P-value does not necessarily tell us that we have a strong predictive relationship, particularly when the sample size is large.

From the Parameter Estimates section of the output in Figure 11.6, we obtain the fitted regression equation

GPA^=0.069+0.123HSM+0.136HSS+0.058HSE

We can use this equation to predict the grade point average for any incoming student. For example, if a student had an A− average in HSM, B+ in HSS, and B in HSE, the explanatory variables are HSM=9, HSS=8, and HSE=7. The predicted GPA is

GPA^=0.069+0.123(9) +0.136(8) +0.058(7)=2.67

Example 11.7 Regression on high school grades: The individual t tests.

Data set icon for gpa.

The ANOVA F test assesses the predictive ability of the set of explanatory variables. It does not tell us if each variable is helpful, given the other variables in the model. That is assessed using the individual parameter t tests. These are also supplied in the Parameter Estimates section of Figure 11.6. Recall that the t statistics for testing the regression coefficients are obtained by dividing the estimates by their standard errors. Thus, for the coefficient of HSM, we obtain the t-value given in the output by calculating

t=bSEb=0.123250.05486=2.25

The P-values appear in the last column. Note that these P-values are for the two-sided alternatives. HSM has a P-value of 0.0262, and we conclude that the regression coefficient for this explanatory variable is significantly different from 0. The P-values for the other explanatory variables (0.0536 for HSS and 0.3728 for HSE) do not achieve statistical significance.

Interpretation of results

The significance tests for the individual regression coefficients seem to contradict the impression obtained by examining the correlations in Figure 11.4. In that display, we see that the correlation between GPA and HSS is 0.44, and the correlation between GPA and HSE is 0.36. Both of these correlations are statistically significant, meaning that if we used HSS alone in a regression to predict GPA, or if we used HSE alone, we would obtain statistically significant regression coefficients.

This phenomenon is not unusual in multiple regression analysis. Part of the explanation lies in the correlations between HSM and the other two explanatory variables. These are rather high (at least compared with most other correlations in Figure 11.4). The correlation between HSM and HSS is 0.67, and that between HSM and HSE is 0.49. Thus, when we have a regression model that contains all three high school grades as explanatory variables, there is considerable overlap of the predictive information contained in these variables. This is called collinearity or multicollinearity. In extreme cases, collinearity can cause numerical instabilities that result in very imprecise parameter estimates.

As mentioned earlier, caution the significance tests for individual regression coefficients assess the significance of each predictor variable, assuming that all other predictors are included in the regression equation. Given that we use a model with HSM and HSS as predictors, the coefficient of HSE is not statistically significant. Similarly, given that we have HSM and HSE in the model, HSS does not have a significant regression coefficient. HSM, however, adds significantly to our ability to predict GPA even after HSS and HSE are already in the model.

Unfortunately, we cannot conclude from this analysis that the pair of explanatory variables HSS and HSE contribute nothing significant to our model for predicting GPA once HSM is in the model. Questions like these require fitting additional models.

The impact of relationships among the several explanatory variables on fitting models for the response is the most important new phenomenon encountered in moving from simple linear regression to multiple regression. In this chapter, we can only illustrate some of the many complicated problems that can arise.

Examining the residuals

As in simple linear regression, we should always examine the residuals as an aid to determining whether the multiple regression model is appropriate for the data. Because there are several explanatory variables, we must examine several residual plots. It is usual to plot the residuals versus the predicted values y^ and also versus each of the explanatory variables. Look for outliers, influential observations, evidence of a curved (rather than linear) relation, and anything else unusual. We leave the task of making these plots for this case study to Check-in question 11.5. We find the plots all show more or less random scatter above and below the center value of 0.

Example 11.8 Checking the Normality of the residuals.

If the deviations ϵ in the model are Normally distributed, the residuals should be Normally distributed. Figure 11.7 presents a Normal quantile plot and histogram of the residuals. Both suggest that the distribution of residuals is skewed to the left. This shape can result when there is an upper limit, or ceiling, for the response variable. In this example, the GPAs are generally high, with roughly half above 3.0. Given that a GPA cannot be higher than 4.0, residuals can’t be as large on the positive side as on the negative side of the fitted regression equation.

A normal quantile plot and histogram of residuals. — Figure 11.7 (a) Normal quantile plot and (b) histogram of the residuals from the high school grades model, Example 11.8.

When many observations are near this upper limit, inference can suffer from a ceiling effect. When the ceiling effect is too strong, least-squares regression can struggle to identify useful predictors and may lead to inaccurate predictions. Given the large sample size for this example, we do not think this skewness is strong enough to invalidate our inference on coefficients. This skewness, however, would be an issue if we considered constructing prediction intervals. More advanced models are available for analysis in the presence of a ceiling effect. Consult an expert if you are concerned that the effect may be too strong.

Check-in

11.5 Residual plots for the GPA analysis. Using a statistical package, fit the linear model with HSM, HSS, and HSE as predictors and obtain the residuals and predicted values. Plot the residuals versus the predicted values, HSM, HSS, and HSE. Are the residuals more or less randomly dispersed around zero? Comment on any unusual patterns.

Refining the model

Because the variable HSE has the largest P-value of the three explanatory variables (see Figure 11.6) and, therefore, appears to contribute the least to our explanation of GPA, we rerun the regression using only HSM and HSS as explanatory variables.

Example 11.9 Regression on HSM and HSS.

Minitab output appears in Figure 11.8. The F statistic indicates that we can reject the null hypothesis that the regression coefficients for the two explanatory variables are both 0. The P-value is still <0.0005. The value of R2 has dropped very slightly compared with our previous run, from 0.2277 to 0.2235. Thus, dropping HSE from the model resulted in the loss of very little explanatory power.

A Minitab output for a regression analysis. — Figure 11.8 Multiple regression output for model using HSM and HSS to predict GPA, Example 11.9.

The output shows a regression equation and three tables of data as follows. Regression equation, G P A = 0.257 + 0.1250 H S M + 0.1718 H S S. Coefficients. Term, constant. Coefficient, 0.402. S E coefficient, 0.402. T-value, 0.64. P-value, 0.524. V I F, blank. Term, H S M. Coefficient, 0.1250. S E coefficient, 0.0548. T-value, 2.28. P-value, 0.024. V I F, 1.81. Term, H S S. Coefficient, 0.1718. S E coefficient, 0.05474. T-value, 2.99. P-value, 0.003. V I F, 1.81. Model summary. S, 0.725607. R square, 22.35 percent. R square adjusted, 21.29 percent. R squared predicted, 19.23 percent. Analysis of variance. Source, regression. D F, 2. Adjusted S S, 22.279. Adjusted M S, 11.1393. F-value, 21.16. P-value, 0.000. Source, H S M. D F, 1. Adjusted S S, 2.740. Adjusted M S, 2.7401. F-value, 5.20. P-value, 0.024. Source, H S S. D F, 1. Adjusted S S, 4.718. Adjusted M S, 4.7178. F-value, 8.96. P-value, 0.003. Source, error. D F, 147. Adjusted S S, 77.396. Adjusted M S, 0.5265. Source, lack of fit. D F, 24. Adjusted S S, 11.225. Adjusted M S, 0.4677. F-value, 0.87. P-value, 0.642. Source, pure error. D F, 123. Adjusted S S, 66.172. Adjusted M S, 0.5380. Source, total. D F, 149. Adjusted S S, 99.675.

The estimated model standard deviation s is nearly identical for the two regressions, which is another indication that we lose very little when we drop HSE. The t statistics for the individual regression coefficients indicate that HSM is still significant (P=0.0240), while the statistic for HSS is larger than before (2.99 versus 1.95) and is now statistically significant (P=0.0032).

Comparison of the fitted equations for the two multiple regression analyses tells us something more about the intricacies of this procedure. For the first run, we have

GPA^=0.069+0.123HSM+0.136HSS+0.058HSE

whereas the second gives us

GPA^=0.257+0.125HSM+0.172HSS

Eliminating HSE from the model changes the regression coefficients for all the remaining variables and the intercept. This phenomenon occurs quite generally in multiple regression. caution Individual regression coefficients, their standard errors, and significance tests are meaningful only when interpreted in the context of the other explanatory variables in the model.

What should not change much if the two models have similar R2 and regression standard error s are the predicted values. On page 581, we used the first regression equation to predict the GPA for a student with an A− average in HSM, B+ in HSS, and B in HSE. The predicted value was 2.67. Using the second equation, we have

GPA^=0.257+0.125(9) +0.172(8)=2.76

The difference in estimates is small relative to the regression standard error.

Considering other sets of explanatory variables

Let’s now turn to the problem of predicting GPA using just the three SAT scores.

Example 11.10 Regression on SAT scores.

Figure 11.9 gives the Excel output for the multiple regression using the three SAT scores. The fitted model is

GPA^=0.458+0.00301SATM+0.00080SATCR+0.00008SATW

The degrees of freedom are as expected: 3, 146, and 149. The F statistic is 6.28, with a P-value of 0.0005. We conclude that the regression coefficients for SATM, SATCR, and SATW are not all 0. Recall that we obtained the P-value <0.0001 when we used high school grades to predict GPA. Both multiple regression equations are highly significant, but this obscures the fact that the two models have quite different explanatory power. For the SAT regression, R2=0.1143, whereas for the high school grades model even with only HSM and HSS (Figure 11.8), we have R2=0.2235, a value almost twice as large. caution Stating that we have a statistically significant result is quite different from saying that an effect is large or important.

An Excel output for a regression analysis. — Figure 11.9 Multiple regression output for model using SAT scores to predict GPA, Example 11.10.

The summary output shows three tables of data as follows. First table, regression statistics. Multiple R, 0.338031611. R square, 0.11426537. Adjusted R square, 0.096065343. Standard error, 0.77762162. Observations, 100. Second table, ANOVA. Source, regression. d f, 3. S S, 11.3893913. M S, 3.796464. F, 6.278308. Significance F, 0.000489385. Source, residual. D F, 146. S S, 88.28552604. M S, 0.604695. Source, total. D F, 149. S S, 99.67491733. Third table. Term, intercept. Coefficients, 0.457974839. Standard error, 0.566567988. t Stat, 0.808332. P-value, 0.420215. Lower 95 percent, negative 0.661759342. Upper 95 percent, 1.577709019. Term, S A T M. Coefficients, 0.003013007. Standard error, 0.001072436. t Stat, 2.809497. P-value, 0.005642. Lower 95 percent, 0.000893502. Upper 95 percent, 0.005132512. Term, S A T C R. Coefficients, 0.000803244. Standard error, 0.001125752. t Stat, 0.713518. P-value, 0.476664. Lower 95 percent, negative 0.001421631. Upper 95 percent, 0.003028119. Term, S A T W. Coefficients, 7.88231 E negative 5. Standard error, 0.001203541. t Stat, 0.065493. P-value, 0.947872. Lower 95 percent, negative 0.002299791. Upper 95 percent, 0.002457437.

Further examination of the output in Figure 11.9 reveals that the coefficient of SATM is significant (t=2.81, P=0.006) and that those of SATCR (t=0.71, P=0.477) and SATW (t=0.07, P=0.948) are not. For a complete analysis, we should carefully examine the residuals. Also, we might want to run the analysis without SATW and the analysis with SATM as the only explanatory variable.

We have seen that fitting a model using either the high school grades or the SAT scores results in a highly significant regression equation. The mathematics component of each of these groups of explanatory variables appears to be a key predictor. A comparison of the values of R2 for the two models indicates that high school grades are better predictors than SAT scores. Can we get a better prediction equation using all the explanatory variables together in one multiple regression?

Example 11.11 Regression using all variables.

To address this question, we run the regression with all six explanatory variables. The output from JMP appears in Figure 11.10. The degrees of freedom are as expected: 6, 143, and 149. The F statistic is 8.95, with a P-value <0.0001, so at least one of our explanatory variables has a nonzero regression coefficient. This result is not surprising, given that we have already seen that HSM and SATM are significant predictors of GPA in their respective three-variable regressions. The value of R2 is 0.2730, which is about 0.05 higher than the value of 0.2235 that we found for the high school grades regression.

The output shows three expanded dropdown list menus, response G P A, whole model, and summary of fit, below which is a table with the following data. R square, 0.227739. R square adjusted, 0.242487. Root mean square error 0.711861. Mean of response, 2.842133. Observations, or sum weights, 150. Another expanded menu, analysis of variance, shows a table with the following data. Source, model. D F, 6. Sum of squares, 27.210300. Mean square, 4.53505. F ratio, 8.9494. Source, error. D F, 143. Sum of squares, 72.464618. Mean square, 0.50675. Source, C total. D F, 149. Sum of squares, 99.674917. Probability greater than F, less than 0.0001 asterisk. Another dropdown menu, parameter estimates, shows a table with the following data. Term, intercept. Estimate, negative -1.186783. Standard error, 0.616408. t ratio, negative 1.93. Probability greater than absolute value of t, 0.0562. Term, H S M. estimate, 0.0914769. Standard error, 0.057181. t ratio, 1.6. Probability greater than absolute value of t, 0.1119 asterisk. Term, H S S. Estimate, 0.1300966. Standard error, 0.068768. t ratio, 1.89. Probability greater than absolute value of t, 0.0605. Term, H S E. Estimate, 0.0567907. Standard error, 0.065681. t ratio, 0.86. Probability greater than absolute value of t, 0.3887. Term, S A T M. Estimate, 0.0019887. Standard error, 0.001057. t ratio, 1.88. Probability greater than absolute value of t, 0.0619. Term, S A T C R. Estimate, 0.000157. Standard error, 0.001049. t ratio, 0.15. Probability greater than absolute value of t, 0.8813. Term, S A T W. Estimate, 0.000474. Standard error, 0.001117. t ratio, 0.42. Probability greater than absolute value of t, 0.6719.

Software Output

Examination of the t statistics and the associated P-values for the individual regression coefficients reveals a surprising result. None of the variables are significant! At first, this result may appear to contradict the ANOVA results. How can the model explain more than 27% of the variation and have t tests suggesting that none of the variables make significant contributions?

Once again, it is important to understand that these t tests assess the contribution of each variable when it is added to a model that already has the other five explanatory variables. This result does not necessarily mean that the regression coefficients for the six explanatory variables are all 0. It simply means that the contribution of each variable overlaps considerably with the contributions of the other five variables already in the model.

When a model has a large number of insignificant variables, it is common to refine the model. This is often termed model selection. We prefer smaller models to larger models because they are easier to work with and understand. However, given the many complications that can arise in multiple regression, there is no universal “best” approach to refine a model. There is also no guarantee that there is just one acceptable refined model.

Many statistical software packages now provide the capability of summarizing all 2p−1 possible models from a set of p variables. We suggest using this capability when possible to reduce the number of candidate models (for example, there are a total of 63 models when p=6) and then carefully study the remaining models before making a decision as to a best model or set of best models. If in doubt, consult an expert.

Figure 11.11 contains Minitab output that shows the two best models in terms of highest R2 when the number of explanatory variables in the model are p=1 through p=6. A list like this can be very helpful in reducing the number of models to consider. In this case, there’s very little difference in R2 once the model has at least three explanatory variables, so we might consider further studying just the four models listed with p=3 or p=4.

A Minitab output of regression models. — Figure 11.11 Minitab output, summarizing the fit of different regression models based on several model selection methods.

The output is titled, response is G P A. It shows a table of data. A table has 11 rows and 12 columns. The columns have the following headings from left to right. Variables, R square, R square adjusted, R square predicted, Mallows C p, S, H S M, H S S, H S E, S A T M, S A T C R, S A T W. The row entries are as follows. Row 1. Variables, 1. R square, 19.6. R square adjusted, 19.1. R square predicted, 17.4. Mallows C p, 12.1. S, 0.73584. H S M, blank. H S S, x. H S E, blank. S A T M, blank. S A T C R, blank. S A T W, blank. Row 2. Variables, 1. R square, 17.6. R square adjusted, 17.1. R square predicted, 15.3. Mallows C p, 16. S, 0.74487. H S M, x. H S S, blank. H S E, blank. S A T M, blank. S A T C R, blank. S A T W, blank. Row 3. Variables, 2. R square, 25.4. R square adjusted, 24.4. R square predicted, 22. Mallows C p, 2.8. S, 0.71131. H S M, blank. H S S, x. H S E, blank. S A T M, x. S A T C R, blank. S A T W, blank. Row 4. Variables, 2. R square, 22.4. R square adjusted, 21.3. R square predicted, 19.2. Mallows C p, 8.7. S, 0.72561. H S M, x. H S S, x. H S E, blank. S A T M, blank. S A T C R, blank. S A T W, blank. Row 5. Variables, 3. R square, 26.6. R square adjusted, 25.1. R square predicted, 22.3. Mallows C p, 2.4. S, 0.70801. H S M, x. H S S, x. H S E, x. S A T M, x. S A T C R, blank. S A T W, blank. Row 6. Variables, 3. R square, 26. R square adjusted, 24.4. R square predicted, 21. Mallows C p, 3.6. S, 0.71097. H S M, blank. H S S, x. H S E, x. S A T M, x. S A T C R, blank. S A T W, blank. Row 7. Variables, 4. R square, 27.1. R square adjusted, 25.1. R square predicted, 21.1. Mallows C p, 3.4. S, 0.70797. H S M, x. H S S, x. H S E, x. S A T M, x. S A T C R, blank. S A T W, blank. Row 8. Variables, 4. R square, 26.9. R square adjusted, 24.9. R square predicted, 21.5. Mallows C p, 3.8. S, 0.70898. H S M, x. H S S, x. H S E, x. S A T M, x. S A T C R, blank. S A T W, x. Row 9. Variables, 5. R square, 27.3. R square adjusted, 24.8. R square predicted, 20.1. Mallows C p, 5. S, 0.70944. H S M, x. H S S, x. H S E, x. S A T M, x. S A T C R, blank. S A T W, x. Row 10. Variables, 5. R square, 27.2. R square adjusted, 24.7. R square predicted, 19.7. Mallows C p, 5.2. S, 0.70983. H S M, x. H S S, x. H S E, x. S A T M, x. S A T C R, x. S A T W, blank. Row 11. Variables, 6. R square, 27.2. R square adjusted, 24.2. R square predicted, 18.9. Mallows C p, 7. S, 0.71186. H S M, x. H S S, x. H S E, x. S A T M, x. S A T C R, x. S A T W, x.

The list also contains the regression standard error s. Finding a model (or models) that minimizes this quantity is another model selection approach. It is equivalent to choosing a model based on the largest adjusted R2. Unlike R2, the adjusted R2 and regression standard error s take into account the number of parameters in the model and thus penalize larger models. In Figure 11.11, we see that a model with p=3 and a model with p=4 have the two smallest s values. Pending an examination of the residuals, these are the two best refined models based on this selection method.

Test for a collection of regression coefficients

Many statistical software packages also provide the capability for testing whether a collection of regression coefficients in a multiple regression model are all 0. We use this approach to address two interesting questions about our GPA data set. We did not discuss such tests in the first section of this chapter, but the basic idea is quite simple and is discussed in Exercise 11.30 (page 591).

In the context of the multiple regression model with all six predictors, we ask first whether or not the coefficients for the three SAT scores are all 0. In other words, do the SAT scores add any significant predictive information to that already contained in the high school grades? To be fair, we also ask the complementary question: Do the high school grades add any significant predictive information to that already contained in the SAT scores?

The answers are given in the R output in Figure 11.12. For the first test, we see that F=2.97. Under the null hypothesis that the three SAT coefficients are 0, this statistic has an F(3, 143) distribution, and the P-value is 0.0341. We conclude that the SAT scores (as a group) are significant predictors of GPA in a regression that already contains the high school scores as predictor variables. This means that we cannot just focus on refined models that involve the high school grades. Both high school grades and SAT scores appear to contribute to our explanation of GPA.

The test statistic for the three high school grade variables is F=10.41. Under the null hypothesis that these three regression coefficients are 0, the statistic has an F(3, 143) distribution, and the P-value is <0.0001. Again, this means that high school grades contain useful information for predicting GPA that is not contained in the SAT scores.

An R output of regression coefficient data. — Figure 11.12 R output, summarizing two tests for different collections of regression coefficients.

The output lists two sets of data as follows. First data set. model = l m, G P A approximately equals HSM + H S S + H S E + S A T M + S A T C R + S A T W , G P A. linear Hypothesis, model , c, S A T M = 0, S A T C R = 0, S A T W = 0. Linear hypothesis test. Hypothesis. S A T M = 0. S A T C R = 0. S A T W = 0. Model 1, restricted model. Model 2, G P A approximately equals H S M + H S S + H S E + S A T M + S A T C R + S A T W. Model, 1. Residual D F, 146. R S S, 76.975. Model, 2. Residual D F, 143. R S S, 72.465. D F, 3. Sum of squares, 4.5104. F, 2.9669. Probability greater than F, 0.0341. Significance, 0.01. Second data set. linear Hypothesis, model, c, H S M = 0, H S S = 0, H S E = 0. Linear hypothesis test. Hypothesis. H S M = 0. HSS = 0. H S E = 0. Model 1, restricted model. Model 2, G P A approximately equals H S M + H S S + H S E + S A T M + S A T C R + S A T W. Model, 1. Residual D F, 146. R S S, 88.286. Model, 2. Residual D F, 143. R S S, 72.465. D F, 3. Sum of squares, 15.821. F, 10.407. Probability greater than F, 3.087 e negative 6. Significance, 0.

Beyond the Basics

Regression trees

In multiple linear regression, we seek to determine a single population regression equation that holds over the entire data range. As we saw with our case study, building this single regression equation can be quite complicated. A conceptually simple yet powerful alternative that can handle nonlinear relationships (page 533) and interactions (Exercise 11.18, page 575) is regression trees. This approach splits, or partitions, the data space into smaller and smaller regions such that these regions, or leaves, can be fit using a simple model like the mean. The predicted value for any set of explanatory variables is then obtained by simply moving down the tree until reaching a leaf. The mean value of that leaf serves as the estimate.

Figure 11.13 JMP output of the regression tree for predicting GPA after three splits.

The output shows an expanded dropdown list menu, partition for G P A. Below are six tables with the following data. First table, split, prune. R square, 0.298. R M S E, 0.6831009. N, 150. Number of splits, 3. A I C c, 321.764. Split prune leads to table, all rows. Count, 150. Mean, 2.8421333. Standard deviation, 0.8178992. Log worth, 8.0589854. Difference, 0.7024. All rows branches to tables H S S less than 10 and H S S greater than or equal to ten. H S S less than 10. Count, 87. Mean, 2.5471264. Standard deviation, 0.8049136. Log worth, 2.4456263. Difference, 0.61741. H S S greater than or equal to 10. Count, 63. Mean, 3.2495238. Standard deviation, 0.6466949. Log worth, 2.2163849. Difference, 0.52965. H S S less than 10 branches to tables S A T M less than 580 and S A T M greater than or equal to 580. S A T M less than 580. Count, 29. Mean, 2.1355172. Standard deviation, 0.8008415. S A T M, greater than or equal to 580. Count, 58. Mean, 2.752931. Standard deviation, 0.7301838. H S S greater than or equal to 10 branches to tables S A T C R less than 570 and S A R CR greater than or equal to 570. S A T C R less than 570. Count, 26. Mean, 2.9384615. Standard deviation, 0.5759423. S A T C R greater than or equal to 570. Count, 37. Mean, 3.4681081. Standard deviation, 0.6090285.

Figure 11.13 contains the start of a regression tree for predicting GPA using the six explanatory variables HSM, HSS, HSE, SATM, SATCR, and SATW. The output summarizes the tree after three splits. The data were first split using HSS<10. This resulted in two nodes, one with 87 cases and the other with 63. The node with 63 is all the students who got straight A’s in science. The node with 87 is all the other students. Each of these nodes was further split. The node of 87 cases was split using SATM<580, and the node with 63 cases was split using SATCR<570.

These partitions are determined by searching each explanatory variable within each node for a split value that minimizes the sum of squared error. The LogWorth statistic in the nodes that split is related to the magnitude of this change. The Difference statistic represents the difference in the means of the resulting nodes. For example, the mean GPA for the students who got all A’s in science is 3.25, and the mean GPA of the other 87 students is 2.55. The difference between these means is reported in the All Rows node. You can see that the LogWorth and Difference statistics are getting smaller and smaller with each split because the GPA values within each node that splits are more similar.

There are four leaves in Figure 11.12. These leaves are like our subpopulations in regression except that there is no regression equation linking the means. To get the predicted value for any set of explanatory variables, we move down the tree using the split rules until reaching a leaf. The mean value of that leaf serves as the estimate. For example, using the tree in Figure 11.12, any set of explanatory variables with HSS=10 and SATCR<570 has a predicted GPA of 2.938.

Even though the tree in Figure 11.12 has only four predicted values (i.e., four leaves), it has an R2=0.30 and model standard error RMSE=0.683. These values are already comparable to the best linear regression model we found in Section 11.2. The flexibility generated by partitioning the data make this method very popular for prediction.

Fortunately, there are fast, reliable algorithms to do the partition searching for us. In fact, given the ease of splitting, it is not difficult to create overly complex trees that do not generalize well. We call this overfitting. In practice, it is common to use cross-validation to avoid overfitting. The goal of cross-validation is to assess how well a model predicts new data that were not used in the estimation. This is done by dividing the data into training and testing samples. The model is built using the training data and assessed using the testing sample. Common proportions for this split are 70% training and 30% testing. In the case of regression trees, there is often a third sample, known as the validation sample, that is used to determine when to stop splitting.

Section 11.2 SUMMARY

Multiple linear regression should always begin with a careful examination of the data. This involves looking at each variable separately and then at pairs of variables. Cases with extreme values should be noted and examined carefully throughout the analysis.
Multiple linear regression has the same model conditions as simple linear regression. Prior to inference, always examine the distribution of the residuals and plot the residuals against each of the explanatory variables to make sure there are no remaining patterns and to ensure that there is constant variance.
The estimate bj of βj and the test and confidence interval for βj are all based on a specific multiple linear regression model. The results of all these procedures change if other explanatory variables are added to or deleted from the model.
When p is large, consider using model selection methods such as adjusted R2 to help in refining the model.

Now that you have completed this section, you will be able to:

Perform preliminary data analysis for multiple regression. Review Examples 11.3 through 11.5 (pages 577–578) and try Exercises 11.21 and 11.22.
Interpret statistical software regression output to obtain the least-squares regression equation, regression standard error, multiple correlation coefficient, ANOVA F test, and individual regression coefficient t tests. Review Examples 11.6 and 11.7 (pages 580 and 581) and try Exercise 11.23.
Use residual plots to check the assumptions of the multiple linear regression model. Review Example 11.8 (page 583) and try Exercise 11.25.
Compare models that use different sets of predictors. Review Examples 11.9 through 11.11 (pages 584–586) and try Exercise 11.27.

Section 11.2 EXERCISES

11.21 Annual ranking of world universities. Since 2004, Times Higher Education has provided an annual ranking of the world universities.⁶ A total score for each university is calculated based on weighting the scores for the following five categories: Teaching (30%), Research (30%), Citations (30%), Industry Income (2.5%), and International Outlook (7.5%). Assuming that we don’t know these weights, let’s consider developing a model to predict total score (Overall) based on a sample of 50 universities and using only the first three category scores: Teaching, Research, and Citations.
1. Using numerical and graphical summaries, describe the distribution of each explanatory variable.
2. Using numerical and graphical summaries, describe the relationship between each pair of explanatory variables.
11.22 Looking at the simple linear regressions. Refer to the previous exercise. Now look at the relationship between each explanatory variable and the total score.
1. Generate scatterplots for each explanatory variable and the total score. Do these relationships all look linear?
2. Compute the correlation between each explanatory variable and the total score. Are certain explanatory variables more strongly associated with the total score than others?
11.23 Multiple linear regression model. Refer to the previous two exercises. Let’s now consider a linear regression using all three explanatory variables.
1. Write out the statistical model for this analysis, making sure to specify all assumptions.
2. Run the multiple regression model and specify the fitted regression equation.
3. Obtain the residuals and check the conditions needed for inference.
4. Generate a 95% confidence interval for each coefficient. Explain what these intervals tell you.
5. What percent of the variation in total score is explained by this model? What is the estimate for σ?
11.24 Explaining the results. Refer to the previous exercise. A friend, knowing that these three categories were all weighted the same by Times Higher Education, does not understand why your model fit seems to suggest different weights for these three scores. Explain to your friend why this can happen.
11.25 Refining the GPA model: Residual checks. Figure 11.11 (page 574) provides a list of the top models based on R2. Let’s look more closely at the four models listed with p=3 and p=4. Fit each of these models to the data and obtain the residuals. Do the data fit to each of these models, at least approximately, meet the conditions of the multiple regression model? Provide some plots to support your opinion.
11.26 Considering a transformation. When we regressed GPA versus the high school scores, the residuals were skewed to the left (Figure 11.7, page 583). Refit the model but now use GPA2 as the response variable. Does this transformed response improve the distribution of the residuals? Does the significance of explanatory variables change? Provide numerical and graphical evidence to support your answers.
11.27 Refining the GPA model: Inference. Refer to Exercise 11.25. For each of the four models under consideration, report the least-squares equation, estimated model standard deviation s, and P-values for each of the individual coefficients. Based on these results and the residuals checks of the previous exercise, which model do you think is most preferred? Explain your answer.
11.28 Predicting college debt: Multiple regression. Refer to Exercises 10.6 (page 536) and 10.11 (page 537) for a description of the problem and data set. Let’s now consider fitting a model to predict average debt (AveDebt) using all four explanatory variables: Admit, Grad4Rate, InCost, and InCostAid.
1. Write out the statistical model for this analysis, making sure to specify all assumptions.
2. Run the multiple regression model and specify the fitted regression equation.
3. Obtain the residuals from part (b) and check assumptions. Is Colorado School of Mines still an unusual case? Or can we proceed with inference using the entire data set? Provide a brief summary.
11.29 Predicting college debt: Inference. Refer to the previous exercise. Let’s proceed using the entire data set.
1. Report the F statistic, its degrees of freedom, and the P-value. What do you conclude based on this test result?
2. What percent of the variability in average debt is explained by this model?
3. Using this F test and the individual parameter t tests, write a one-paragraph summary of this model’s fit to the data.
11.30 Testing a collection of variables. Refer to the previous exercise. Although the F test was highly significant, only Admit is found to be significant using the individual parameter t tests. This raises the question whether the other three variables further contribute to the prediction of average debt, given that the admittance rate is in the model.

In this chapter, we discussed the F test for a collection of regression coefficients. In most cases, this capability is provided by the software. When it is not, the test can be performed using the R2-values from the larger (full) and smaller (reduced) models. The test statistic is

F=(n−p−1q)(R12−R221−R12)

with q and n−p−1 degrees of freedom. R12 is the value for the full model, and R22 is the value for the reduced model. Here n=27 schools, p=4 variables in the full model, and p=3 variables were removed to form the reduced model. Plug in the values of R2 from part (b) of the previous exercise and the R2 value you get when regressing AveDebt on Admit. Compute the test statistic and P-value. Do Grad4Rate, InCost, and InCostAid combined add any significant predictive information beyond what is already contained in Admit?

11.31 A mechanistic explanation of popularity. In Exercise 10.61 (page 561), correlations between an adolescent’s “popularity,” expression of a serotonin receptor gene, and rule-breaking behaviors were assessed. An additional portion of the analysis looked at the relationship between the gene expression level and popularity, after adjusting for rule-breaking (RB) behaviors. This adjustment was necessary because RB is positively associated both with this gene expression and with popularity in adolescents. The following table summarizes these regression analyses using the composite (questionnaire and video) RB score. A total of 202 individuals were included in this analysis.

	b	s(b)
Model 1
Gene expression	0.204	0.066
Model 2
Gene expression	0.161	0.066
RB.composite	0.100	0.030

For all analyses, use the 0.05 significance level.

What are the error degrees of freedom for Model 1 and for Model 2?
Test the null hypothesis that the serotonin gene receptor coefficient is equal to 0 in Model 1. State the test statistic and P-value.
Perform both individual-variable t tests for Model 2. Again state the test statistics and P-values.
Is there still a positive relationship between the serotonin gene receptor expression level and popularity, after adjusting for RB? If yes, compare the increase in popularity for a unit increase in gene expression (while RB remains unchanged) in the two models.

11.32 Consider the sex of the students. Refer to Example 11.11 (page 586). The seventh explanatory variable provided in the GPA data set is a sex indicator variable. This variable (Sex) takes the value 0 for males and 1 for females. If we include it in our model involving the other six variables, it allows the intercept to differ for the two sexes (see Exercises 11.16 and 11.17, pages 574 and 575).
1. Include the variable Sex with the other six explanatory variables and refit the model. Compare the fit of this model, using R2 and s, with the model in Example 11.11.
2. Does this indicator variable appear to contribute to our explanation of GPA? Report the test results.
3. Does the coefficient suggest that males or females have higher GPA scores? Explain your answer.

11.33 Predicting energy-drink consumption. Energy-drink advertising consistently emphasizes a physically active lifestyle and often features extreme sports and risk taking. Are these typical characteristics of an energy-drink consumer? A researcher decided to examine the links between energy-drink consumption, sport-related (jock) identity, and risk taking.⁷ She invited more than 1500 undergraduate students enrolled in large introductory-level courses at a public university to participate. Each participant had to complete a 45-minute anonymous questionnaire. From this questionnaire, jock identity and risk-taking scores were obtained, where the higher the score, the stronger the trait. She ended up with 795 respondents. The following table summarizes the results of a multiple regression analysis using the frequency of energy-drink consumption in the past 30 days as the response variable:

Explanatory variable	b
Age	−0.02
Sex (1=female, 0=female)	−0.11**
Race (1=nonwhite, 0=white)	−0.02
Ethnicity (1=Hispanic, 0=non-Hispanic)	0.10**
Parental education	0.02
College GPA	−0.01
Jock identity	0.05
Risk taking	0.19***

The superscript ** means that the individual coefficient t test had a P-value less than 0.01, and the superscript *** means that the test had a P-value less than 0.001. All other P-values were greater than 0.05.

The overall F statistic is reported to be 8.11. What are the degrees of freedom associated with this statistic?
R is reported to be 0.28. What percent of the variation in energy-drink consumption is explained by the model? Is this a highly predictive model? Explain.
Interpret each of the regression coefficients that are significant.
The researcher states, “Controlling for gender, age, race, ethnicity, parental educational achievement, and college GPA, each of the predictors (risk taking and jock identity) was positively associated with energy-drink consumption frequency.” Explain what is meant by “controlling for” these variables and how this helps strengthen her assertion that jock identity and risk taking are positively associated with energy-drink consumption.

11.34 Is the number of tornadoes increasing? In Exercise 10.15, data on the number of tornadoes in the United States between 1953 and 2019 were analyzed to see if there was a linear trend over time. Some argue that it’s not the number of tornadoes increasing over time but rather the probability of sighting them is increasing because there are more people living in the United States. Let’s investigate this by including the U.S. census count as an additional explanatory variable.
1. Using numerical and graphical summaries, describe the relationship between each pair of variables.
2. Perform a multiple regression using both year and census count as explanatory variables. Write down the fitted model.
3. Obtain the residuals from part (b). Plot them versus the two explanatory variables and generate a Normal quantile plot. What do you conclude?
4. Test the hypothesis that there is a linear increase over time. State the null and alternative hypotheses, test statistic, and P-value. What is your conclusion?