11.1 Inference for Multiple Regression in Chapter 11 Multiple Regression

11.1 Inference for Multiple Regression

When you complete this section, you will be able to:

Describe a multiple linear regression model in terms of a response variable, the explanatory variables, and the number of cases.
Compute the predicted response from an estimated population regression equation.
Outline an ANOVA table for a multiple regression involving p predictors in terms of degrees of freedom.
Use an ANOVA table to find the F statistic, regression standard error, and squared multiple correlation.
Interpret the ANOVA F test and the t tests for individual coefficients.

Population multiple regression equation

The simple linear regression model assumes that the mean of the response variable y depends on the explanatory variable x according to a linear equation

μy=β0+β1x

For any fixed value of x, the response y varies Normally about this mean and has a standard deviation σ that is the same for all values of x.

In the multiple regression setting, the response variable y depends on p explanatory variables, which we will denote by x1, x2, …, xp. The mean response depends on these explanatory variables according to a linear function

μy=β0+β1x1+β2x2+⋯+βpxp

Similar to simple linear regression, this expression is the population regression equation, and the observed values y vary about their means given by this equation.

Just as we did in simple linear regression, we can also think of this model in terms of subpopulations of responses. Here, each subpopulation corresponds to a particular set of values for all the explanatory variables x1, x2, …, xp. In each subpopulation, y varies Normally with a mean given by the population regression equation. The regression model assumes that the standard deviation σ of the responses is the same in all subpopulations.

Example 11.1 Predicting early success in college.

Our case study is based on data collected on science majors at a large university.¹ The goal of the study was to develop a model to predict success in the early university years. Success was measured using the cumulative grade point average (GPA) after three semesters. The explanatory variables were achievement scores available at the time of enrollment in the university. These included their average high school grades in mathematics (HSM), science (HSS), and English (HSE).

We will use high school grades to predict the response variable GPA. There are p=3 explanatory variables: x1=HSM, x2=HSS, and x3=HSE. The high school grades are coded on a scale from 1 to 10, with 10 corresponding to A, 9 to A−, 8 to B+, and so on. These grades define the subpopulations. For example, the straight-C students are the subpopulation defined by HSM=4, HSS=4, and HSE=4.

The population multiple regression equation for the mean GPAs is

μGPA=β0+β1HSM+β2HSS+β3HSE

For the straight-C subpopulation of students, the equation gives the mean as

μGPA=β0+β14+β24+β34

Data for multiple regression

The data for a simple linear regression problem consist of n pairs of a response variable y and an explanatory variable x. Because there are several explanatory variables in multiple regression, the notation needed to describe the data is more elaborate. Each observation, or case, consists of a value for the response variable and for each of the explanatory variables. Call xij the value of the jth explanatory variable for the ith case. The data are then

Case 1: (y1, x11, x12, …, x1p)Case 2: (y2, x21, x22, …, x2p)⋮Case n: (yn, xn1, xn2, …, xnp)

Here, n is the number of cases, and p is the number of explanatory variables. Data are often entered into computer regression programs in this format. Each row is a case, and each column corresponds to a different variable.

Example 11.2 The data for predicting early success in college.

Data set icon for gpa.

The data for Example 11.1, with several additional explanatory variables, appear in this format in the GPA data file. Figure 11.1 shows the first six rows entered into an Excel spreadsheet. Grade point average (GPA) is the response variable, followed by p=7 explanatory variables: six achievement scores and sex. There are a total of n=150 students in this data set.

The six achievement scores are all quantitative explanatory variables. Sex is an indicator variable using the numerical values 0 and 1 to represent male and female, respectively.

A data set entered in an Excel sheet. — Figure 11.1 Format of the data set for predicting success in college, Example 11.2.

The table has 6 rows and 9 columns. The columns have the following headings from left to right. Obs, G P A, H S M, H S S, H S E, S A T M, S A T C R, S A T W, Sex. The row entries are as follows. Row 1. Obs, 1. G P A, 3.84. H S M, 10. H S S, 10. H S E, 10. S A T M, 630. S A T C R, 570. S A T W, 590. Sex, 1. Row 2. Obs, 2. G P A, 3.97. H S M, 10. H S S, 10. H S E, 10. S A T M, 750. S A T C R, 700. S A T W, 630. Sex, 0. Row 3. Obs, 3. G P A, 3.49. H S M, 8. H S S, 10. H S E, 9. S A T M, 570. S A T C R, 510. S A T W, 490. Sex, 1. Row 4. Obs, 4. G P A, 1.95. H S M, 6. H S S, 4. H S E, 8. S A T M, 640. S A T C R, 600. S A T W, 610. Sex, 0. Row 5. Obs, 5. G P A, 2.59. H S M, 8. H S S, 10. H S E, 9. S A T M, 510. S A T C R, 490. S A T W, 490. Sex, 1. Row 6. Obs, 6. G P A, 3. H S M, 7. H S S, 10. H S E, 10. S A T M, 660. S A T C R, 680. S A T W, 630. Sex, 0.

Indicator variables are used frequently in multiple regression to represent the levels or groups of categorical explanatory variable. See Exercises 11.16 through 11.18 (pages 574–575) for more discussion of their use in a multiple regression model.

Check-in

11.1 Describing a multiple regression. To minimize the negative impact of math anxiety on achievement in a research design course, a group of investigators implemented a series of feedback sessions, in which the teacher went over the small-group assignments and discussed the most frequently committed errors.² At the end of the course, data from 166 students were obtained. The investigators studied how students’ final exam scores were explained by math course anxiety, math test anxiety, numerical task anxiety, enjoyment, self-confidence, motivation, and perceived usefulness of the feedback sessions.
1. What is the response variable?
2. What are the cases, and what is n, the number of cases?
3. What is p, the number of explanatory variables?
4. What are the explanatory variables?

Multiple linear regression model

Similar to simple linear regression, we combine the population regression equation and the assumptions about how the observed y vary about their means to construct the multiple linear regression model. The subpopulation means describe the FIT part of our conceptual model

DATA=FIT+RESIDUAL

The RESIDUAL part represents the variation of observed y about their means.

We will use the same notation for the residual that we used in the simple linear regression model. The symbol ϵ represents the deviation of an individual observation from its subpopulation mean. These deviations are Normally distributed with mean 0 and an unknown model standard deviation σ that does not depend on the values of the x variables. These are conditions that we can check by examining the residuals in the same way that we did for simple linear regression.

Multiple linear regression model

The multiple linear regression model is

yi=β0+β1xi1+β2xi2+⋯+βpxip+ϵi

for i=1, 2, …, n.

The mean response μy is a linear function of the explanatory variables:

μy=β0+β1x1+β2x2+⋯+βpxp

This is called the population regression equation.

The deviations ϵi are assumed to be independent and Normally distributed with mean 0 and standard deviation σ.

The parameters of the model are the regression coefficients β0, β1, β2, . . . , βp, and regression standard deviation σ.

The assumption that the subpopulation means are related to the regression coefficients β by the equation

μy=β0+β1x1+β2x2+⋯+βpxp

implies that we can estimate all subpopulation means from the estimates of the β’s, not just those associated with the cases. To the extent that this equation is accurate, we have a useful tool for describing how the mean of y varies with the collection of x’s. Just as in simple linear regression, however, be cautious when considering a collection of x’s that are outside the range of the data used to fit the model.

We also need to be cautious when interpreting each of the regression coefficients in a multiple regression. First, the β0 coefficient represents the mean of y when all the x variables equal zero. Even more so than in simple linear regression, this subpopulation is rarely of interest. caution Second, the description provided by the regression coefficient of each x variable is similar to that provided by the slope in simple linear regression but only in a specific situation—namely, when all other x variables are held constant. We need this extra condition because with multiple x variables, it is quite possible that a unit change in one x variable may be associated with changes in other x variables. If that occurs, then the overall change in the mean of y is not described by just a single regression coefficient.

Check-in

11.2 Understanding the fitted regression line. The fitted regression equation for a multiple regression is

y^=−10.8+3.2x1+2.8x2
1. If x1=4 and x2=2, what is the predicted value of y?
2. For the answer to part (a) to be valid, is it necessary that the values x1=4 and x2=2 correspond to a case in the data set? Explain why or why not.
3. If you hold x1 at a fixed value, what is the effect of an increase of three units of x2 on the predicted value of y?

Estimation of the multiple regression parameters

Similar to simple linear regression, we use the method of least squares to obtain estimators of the regression coefficients β. Let

b0, b1, b2, …, bp

denote the estimators of the parameters

β0, β1, β2, …, βp

For the ith observation, the predicted response is

y^i=b0+b1xi1+b2xi2+⋯+bpxip

The ith residual, the difference between the observed and the predicted response, is, therefore,

ei=observed response − predicted response=yi−y^i=yi−b0−b1xi1−b2xi2−⋯−bpxip

The method of least squares chooses the values of the b’s that make the sum of the squared residuals as small as possible. In other words, the parameter estimates b0, b1, b2, . . . , bp minimize the quantity

∑ (yi−b0−b1xi1−b2xi2−⋯−bpxip)2

caution The formula for the least-squares estimates in multiple regression is complicated. We will be content to understand the principle on which it is based and to let software do the computations.

The parameter σ2 measures the variability of the responses about the population regression equation. As in the case of simple linear regression, we estimate σ2 by an average of the squared residuals. The estimator is

s2=∑ ei2n−p−1=∑ (yi−y^i)2n−p−1

The quantity n−p−1 is the degrees of freedom associated with s2. The degrees of freedom equal the sample size, n, minus (p+1), the number of β’s we must estimate to fit the model. In the simple linear regression case, there is just one explanatory variable, so p=1, and the degrees of freedom are n−2. To estimate the regression standard deviation σ, we use

s=s2

Confidence intervals and significance tests for regression coefficients

We can obtain confidence intervals and perform significance tests for each of the regression coefficients βj as we did in simple linear regression. The standard errors of the b’s have more complicated formulas, but all are multiples of the estimated model standard deviation s. We again rely on statistical software to do these calculations.

Confidence intervals and significance tests for βj

A level C confidence interval for βj is

bj±t* SEbj

where SEbj is the standard error of bj, and t* is the value for the t(n−p−1) density curve with area C between −t* and t*.

To test the hypothesis H0: βj=βj*, compute the t statistic

t=bj−βj*SEbj

Most software packages provide the test of the hypothesis H0: βj=0. In that case, the t statistic reduces to

t=bjSEbj

In terms of a random variable T having the t(n−p−1) distribution, the P-value for a test of H0 against

Ha: βj>0 is P(T≥t)

Ha: βj<0 is P(T≤t)

Ha: βj≠0 is 2P(T≥| t |)

caution Be very careful in your interpretation of the t tests and confidence intervals for individual regression coefficients. In simple linear regression, the model says that μy=β0+β1x. The null hypothesis H0: β1=0 says that regression on x is of no value for predicting the response y or, alternatively, that there is no straight-line relationship between x and y. The corresponding null hypothesis for the multiple regression model μy=β0+β1x1+β2x2+⋯+βpxp says that x1 is of no value for predicting y, given that x2, x3, …, xp are in the model. In other words, the predictor x1 contains no additional information over or above that contained in x2, x3, …, xp that is useful for predicting y. This a very important distinction and one we will return to when analyzing the data of Example 11.1.

Because regression is often used for prediction, we may wish to use multiple regression models to construct confidence intervals for a mean response and prediction intervals for a future observation. The basic ideas are the same as in the simple linear regression case.

In most software systems, the same commands that give confidence and prediction intervals for simple linear regression work for multiple regression. The only difference is that we specify a list of explanatory variables rather than a single variable. Software allows us to perform these rather complex calculations without an intimate knowledge of all the computational details. This frees us to concentrate on the meaning and appropriate use of the results.

ANOVA table for multiple regression

In simple linear regression, the F test from the ANOVA table is equivalent to the two-sided t test of the hypothesis that the slope of the regression line is 0. For multiple regression, there is a corresponding ANOVA F test, but it tests the hypothesis that all the regression coefficients (with the exception of the intercept) are 0. Here is the general form of the ANOVA table for multiple regression:

Source	Degrees of freedom	Sum of squares	Mean square	F
Model	p	∑ (y^i−y¯)2	SSM/DFM	MSM/MSE
Error	n−p−1	∑ (yi−y^i)2	SSE/DFE
Total	n−1	∑ (yi−y¯)2	SST/DFT

The ANOVA table is similar to that for simple linear regression. The degrees of freedom for the model increase from 1 to p to reflect the fact that we now have p explanatory variables rather than just one. As a consequence, the degrees of freedom for error decrease by the same amount. It is always a good idea to calculate the degrees of freedom by hand and then check that your software agrees with your calculations. This ensures that you have not made some serious error in specifying the model, specifying the variable types, or entering the data.

The sums of squares represent sources of variation. Once again, both the sums of squares and their degrees of freedom add:

SST=SSM + SSEDFT=DFM + DFE

The estimate of the variance σ2 for our model is again given by the MSE in the ANOVA table. That is, s2=MSE.

The ratio MSM/MSE is the F statistic for testing the null hypothesis

H0: β1=β2=⋯=βp=0

against the alternative hypothesis

Ha: at least one of the βj is not 0

The null hypothesis says that none of the explanatory variables are predictors of the response variable when used in the form expressed by the multiple regression equation. The alternative states that at least one of them is a predictor of the response variable.

As in simple linear regression, large values of F give evidence against H0. When H0 is true, F has the F(p, n−p−1) distribution. The degrees of freedom for the F distribution are those associated with the model and error in the ANOVA table.

caution A common error in the use of multiple regression is to assume that all the regression coefficients are statistically different from zero whenever the F statistic has a small P-value. Be sure that you understand the difference between the F test and the t tests for individual coefficients in the multiple regression setting. The F test provides an overall assessment of the model to explain the response. The individual t tests assess the importance of a single variable, given the presence of the other variables in the model. While looking at the set of individual t tests to assess overall model significance may be tempting, it is not recommended because it leads to more frequent incorrect conclusions. The F test also better handles situations when there are two or more highly correlated explanatory variables.

Squared multiple correlation R2

For simple linear regression, we noted that the square of the sample correlation could be written as the ratio of SSM to SST and could be interpreted as the proportion of variation in y explained by x. The ratio of SSM to SST is routinely calculated for multiple regression and still can be interpreted as the proportion of explained variation. The difference is that it relates to the collection of explanatory variables in the model.

We use a capital R here to reinforce the fact that this statistic depends on a collection of explanatory variables. Often, R2 is multiplied by 100 and expressed as a percent. The square root of R2, called the multiple correlation coefficient, is the correlation between the observations yi and the predicted values y^i. Some software provides a scatterplot of this relationship to help visualize the predictive strength of the model.

Section 11.1 SUMMARY

Data for multiple linear regression consist of the values of a response variable y and p explanatory variables x1, x2, …, xp for n cases. We write the data and enter them into software in the form

Individual	Variables
Individual	y	x1	x2	. . .	xp
1	y1	x11	x12	. . .	x1p
2	y2	x21	x22	. . .	x2p
⋮
n	yn	xn1	xn2	. . .	xnp

The multiple linear regression model with response variable y and p explanatory variables x1, x2, …, xp is

yi=β0+β1xi1+β2xi2+⋯+βpxip+ϵi

where i=1, 2, …, n. The ϵi are assumed to be independent and Normally distributed with mean 0 and standard deviation σ. The parameters of the model are the regression coefficients β0, β1, β2, …, βp and the standard deviation σ.
The multiple regression equation predicts the response variable by a linear relationship with all the explanatory variables:

y^=b0+b1x1+b2x2+⋯+bpxp

The β’s are estimated by b0, b1, b2, …, bp, which are obtained by the method of least squares. The model standard deviation σ is estimated by the regression standard error

s=∑ ei2n−p−1=MSE

where the ei are the residuals,

ei=yi−y^i
A level C confidence interval for βj is

bj±t* SEbj

where t* is the value for the t(n−p−1) density curve with area C between −t* and t*.
The test of the hypothesis H0: βj=βj* is based on the t statistic

t=βj=βj*SEbj

and the t(n−p−1) distribution.
The ANOVA table for a multiple linear regression gives the degrees of freedom; sum of squares; and mean squares for the model, error, and total sources of variation.
The ANOVA F statistic is the ratio MSM/MSE from the ANOVA table and is used to test the null hypothesis

H0: β1=β2=⋯=βp=0

If H0 is true, this statistic has an F(p, n−p−1) distribution.
The squared multiple correlation is given by the expression

R2=SSMSST

and is interpreted as the proportion of the variability in the response variable y that is explained by the explanatory variables x1, x2, …, xp in the multiple linear regression.

Now that you have completed this section, you will be able to:

Describe a multiple linear regression model in terms of a response variable, the explanatory variables, and the number of cases. Review Example 11.2 (page 565) and try Exercise 11.3.
Compute the predicted response from an estimated population regression equation. Review Example 11.2 (page 565) and try Exercise 11.5.
Outline an ANOVA table for a multiple regression involving p predictors in terms of degrees of freedom. Review page 570 and try Exercise 11.9.
Use an ANOVA table to find the F statistic, regression standard error, and squared multiple correlation. Review pages 570–572 and try Exercise 11.13.
Interpret the ANOVA F test and the t tests for individual coefficients. Review page 571 and try Exercise 11.19.

Section 11.1 EXERCISES

11.1 What’s wrong? In each of the following situations, explain what is wrong and why.
1. A small P-value for the ANOVA F test implies that all explanatory variables are significantly different from zero.
2. R2 is the proportion of variation explained by the collection of explanatory variables. It can be obtained by squaring the correlations between y and each xi and summing them up.
3. In a multiple regression with a sample size of 45 and six explanatory variables, the test statistic for the null hypothesis H0: b2=0 is a t statistic that follows the t(38) distribution when the null hypothesis is true.
11.2 What’s wrong? In each of the following situations, explain what is wrong and why.
1. One of the assumptions for multiple regression is that the distribution of each explanatory variable is Normal.
2. The null hypothesis H0: β3=0 in a multiple regression involving three explanatory variables implies that there is no linear association between x3 and y.
3. The multiple correlation coefficient gives the average correlation between the response variable and each explanatory variable in the model.
11.3 Describe the regression model. Is the adult life expectancy of a dog breed related to its level of inbreeding? To investigate this, researchers collected information on 168 breeds and fit a model using each breed’s autosomal inbreeding coefficient and the common logarithm of the adult male average weight as explanatory variables.³
1. What is the response variable?
2. What is n, the number of cases?
3. What is p, the number of explanatory variables?
4. What are the explanatory variables?
11.4 Health behavior versus mindfulness among undergraduates. Researchers surveyed 357 undergraduates throughout the United States and quantified each student’s health behavior, mindfulness, subjective sleep quality (SSQ), and perceived stress level.⁴ Of interest was whether the relationship of health behavior versus mindfulness varied by SSQ, after adjusting for perceived stress. To address this, the researchers included the interaction term (mindfulness × SSQ) as an additional explanatory variable.
1. What is the response variable?
2. What is n, the number of cases?
3. What is p, the number of explanatory variables?
4. What are the explanatory variables?
11.5 Predicting life expectancy. Refer to Exercise 11.3. The fitted linear model is

y^=10.8−1.78x1−0.06x2

where x1 is the common logarithm of the adult average weight and x2 is the inbreeding coefficient. In the data set, x1 varies between 0.36 and 1.89, and x2 varies between 0.02 and 0.52
1. What does this equation tell you about the relationship between life expectancy and the weight and inbreeding level of a dog? Explain your answer.
2. What is the predicted life span for a Pug, which has x1=0.86 and x2=0.36.
3. The life expectancy of a Pug is 7.48 years. Compute the residual.
11.6 The effect of inbreeding. Refer to the previous exercise. Say that you hold the average adult weight (x1) at a fixed value.
1. What is the effect of the inbreeding coefficient increasing by 0.1, 0.25, and 0.5 on life expectancy?
2. Given the results in part (a), do you think it is important to test whether β2 is significantly different from 0? Explain your answer.
11.7 95% confidence intervals for regression coefficients. In each of the following settings, give a 95% confidence interval for the coefficient of x1.
1. n=23, y^=1.6+6.5x1+3.7x2, SEb1=3.1
2. n=33, y^=1.6+6.5x1+3.7x2, SEb1=3.2
3. n=23, y^=1.6+4.9x1+3.2x2+5.8x3, SEb1=2.4
4. n=84, y^=1.6+4.9x1+3.2x2+5.8x3, SEb1=2.4
11.8 Significance tests for regression coefficients. For each of the settings in the previous exercise, test the null hypothesis that the coefficient of x1 is zero versus the two-sided alternative.
11.9 Constructing the ANOVA table. Six explanatory variables are used to predict a response variable using multiple regression. There are 183 observations.
1. Write the statistical model that is the foundation for this analysis. Also include a description of all assumptions.
2. Outline the ANOVA table, giving the sources of variation and numerical values for the degrees of freedom.
11.10 More on constructing the ANOVA table. A multiple regression analysis of 57 cases was performed with four explanatory variables. Suppose that SSM=16.5 and SSE=100.8.
1. Find the value of the F statistic for testing the null hypothesis that the coefficients of all the explanatory variables are zero.
2. What are the degrees of freedom for this statistic?
3. Find bounds on the P-value using Table E. Show your work.
4. What proportion of the variation in the response variable is explained by the explanatory variables?

11.11 Significance tests for regression coefficients. Refer to Check-in question 11.1 (page 566). The following table contains the estimated coefficients and standard errors of their multiple regression fit. Each explanatory variable is an average of several five-point Likert scale questions.

Variable	Estimate	SE
Intercept	1.316	0.651
Math course anxiety	−0.212	0.114
Math test anxiety	−0.155	0.119
Numerical task anxiety	−0.094	0.116
Enjoyment	0.176	0.114
Self-confidence	0.118	0.114
Motivation	0.097	0.115
Feedback usefulness	0.644	0.194

Look at the signs of the coefficients (positive and negative). Is this what you would expect in this setting? Explain your answer.
What are the degrees of freedom for the model and error?
Test the significance of each coefficient and state your conclusions.

11.12 Compare the variability. In many multiple regression summaries, researchers report both sy, the standard deviation of the response variable, and s, the regression standard error from the model fit. If the regression model explains the response variable y well, describe what we should expect when comparing sy to s.

11.13 ANOVA table for multiple regression. Use the following information and the general form of the ANOVA table for multiple regression on page 543 to perform the ANOVA F test and compute R2.

Source	Degrees of freedom	Sum of squares
Model	3	90
Error
Total	43	510

11.14 Another ANOVA table for multiple regression. Use the following information and the general form of the ANOVA table for multiple regression on page 543 to perform the ANOVA F test and compute R2.

Source	Degrees of freedom	Sum of squares	Mean square
Model	4		17.5
Error
Total	33	524

11.15 Polynomial models. Multiple regression can be used to fit a polynomial curve of degree q, y=β0+β1x+β2x2+⋯+βqxq, through the creation of additional explanatory variables x2, x3, etc. Sketch each of the following equations for values of x between 0 and 4. Then describe the relationship between μy and x in your own words.
1. μy=5+2x+x2.
2. μy=5−2x+x2.
3. μy=5+2x−x2.
4. μy=5−2x−x2+x3.
11.16 Models with indicator variables. Suppose that x is an indicator variable with the value 0 for Group A and 1 for Group B. The following equations describe relationships between the value of μy and membership in Group A or B. For each equation, give the value of the mean response μy for Group A and for Group B.
1. μy=5+10x.
2. μy=10+5x.
3. μy=15−5x.
11.17 Differences in means. Verify that the coefficient of x in each part of the previous exercise is equal to the mean for Group B minus the mean for Group A. Do you think that this will be true in general? Explain your answer.
11.18 Comparing linear models. When the effect of one explanatory variable depends upon the value of another explanatory variable, we say the explanatory variables interact with each other. In a regression model, interaction can be included using the product of the two explanatory variables as an additional explanatory variable. Suppose that x1 is an indicator variable with the value 0 for Group A and 1 for Group B, and x2 is a quantitative variable. The model

μy=20+10x1+3x2−2x1x2

describes two linear relationships between μy and the explanatory variable x2—one for Group A and one for Group B.
1. Substitute the value 0 for x1 and write the resulting linear equation for μy in terms of x2 for Group A.
2. Substitute the value 1 for x1 and write the linear equation for Group B.
3. Which coefficient in the model is equal to the Group B intercept minus the Group A intercept?
4. Which coefficient in the model is equal to the Group B slope minus the Group A slope?

NAEP 11.19 Game-day spending. Game-day spending (ticket sales and food and beverage purchases) is critical for the sustainability of many professional sports teams. In the National Hockey League (NHL), nearly half the franchises generate more than two-thirds of their annual income from game-day spending. Understanding and possibly predicting this spending would allow teams to respond with appropriate marketing and pricing strategies. To investigate this possibility, a group of researchers looked at data from one NHL team over a three-season period (n=123 home games).⁵ The following table summarizes the multiple regression used to predict ticket sales. Each explanatory variable is an indicator variable taking the value 1 for the condition specified and 0 otherwise. For example, the NHL teams are broken down into two conferences, and the teams in each conference are broken down into two divisions. Thus, Division takes the value 1 if the game involves a divisional opponent and 0 if the opponent is not in the division, and Nonconference takes the value 1 when the opponent is in the other conference and 0 otherwise. These two indicator variables together divide the opponents into three groups: divisional teams, conference teams not in the division, and nonconference teams.

Explanatory variables	b	t
Constant	12,493.47	12.13
Division	−788.74	−2.01
Nonconference	−474.83	−1.04
November	−1800.81	−2.65
December	−559.24	−0.82
January	−925.56	−1.54
February	−35.59	−0.05
March	−131.62	−0.21
Weekend	2992.75	8.48
Night	1460.31	2.13
Promotion	2162.45	5.65
Season 2	−754.56	−1.85
Season 3	−779.81	−1.84

The overall F statistics was 11.59. What are the degrees of freedom and P-value of this statistic?
Does the ANOVA F test result imply that all the predictor variables are significant? Explain your answer.
Use t tests to see which of the explanatory variables significantly aid prediction in the presence of all the explanatory variables. Show your work.
The value of R2 is 0.52. What percent of the variance in ticket sales is explained by these explanatory variables?
The constant predicts the number of tickets sold for a nondivisional, conference game with no promotions played during the day on a weekday in October of Season 1. What is the predicted number of tickets sold for a divisional conference game with no promotions played on a weekend evening in March during Season 3?
Would a 95% confidence interval for the mean response or a 95% prediction interval be more appropriate to include with your answer to part (e)? Explain your reasoning.

11.20 Discrimination at work? A survey of 457 engineers in Canada was performed to identify the relationship of race, language proficiency, and location of training in finding work in the engineering field. In addition, each participant completed the Workplace Prejudice and Discrimination Inventory (WPDI), which is designed to measure perceptions of prejudice on the job, primarily due to race or ethnicity. The score of the WPDI ranged from 16 to 112, with higher scores indicating more perceived discrimination. The following table summarizes two multiple regression models (in terms of coefficient estimates, their standard errors, and R2) used to predict an engineer’s WPDI score. All explanatory variables are indicator variables. The first explanatory variable indicates whether the engineer was foreign trained (x=1) or locally trained (x=0). The next set of seven variables indicate race, and the last six are demographic variables.

	Model 1		Model 2
Explanatory variables	b	s(b)	b	s(b)
Foreign trained	0.55	0.21	0.58	0.22
Chinese			0.06	0.24
South Asian			−0.06	0.19
Black			−0.03	0.52
Other Asian			−0.38	0.34
Latin American			0.20	0.46
Arab			0.56	0.44
Other (not white)			0.05	0.38
Mechanical	−0.19	0.25	−0.16	0.25
Other (not electrical)	−0.14	0.20	−0.13	0.21
Masters/PhD	0.32	0.18	0.37	0.18
30–39 years old	−0.03	0.22	−0.06	0.22
40 or older	0.32	0.25	0.25	0.26
Female	−0.02	0.19	−0.05	0.19
R2	0.10		0.11

The F statistics for these two models are 7.12 and 3.90, respectively. What are the degrees of freedom and P-value of each statistic?
The F statistics for the multiple regressions are highly significant, but the R2 are relatively low. Explain to a statistical novice how this can occur.
Do foreign-trained engineers perceive more discrimination than do locally trained engineers? To address this, test whether the first coefficient in each model is equal to zero versus the greater-than alternative. Summarize your results.