11.35 Checking for a polynomial relationship.
When looking at the residuals from the simple linear model of BMI
versus physical activity (PA),
Figure 10.5
(page 524)
suggested a possible curvilinear relationship. Let’s investigate
fitting a quadratic
It is often best to subtract the sample mean
The regression model that included only PA had
Obtain the residuals from part (a) and check the multiple regression assumptions. Are there any remaining patterns in the data? Are the residuals approximately Normal? Explain.
Test the hypothesis that the coefficient of the variable
11.36 Architectural firm billings. A summary of
firms engaged in commercial architecture in the Indianapolis,
Indiana, area provides firm characteristics, including total
annual billing in the current year, total annual billing in the
previous year, the number of architects, the number of engineers,
and the number of staff employed in the firm.8
Consider developing a model to predict current total billing using
the other four variables.
Using numerical and graphical summaries, describe the distribution of current and past year total billing and the number of architects, engineers, and staff.
For each of the 10 pairs of variables, use graphical and numerical summaries to describe the relationship.
Carry out a multiple regression. Report the fitted regression equation and the value of the regression standard error s.
Analyze the residuals from the multiple regression. Are there any concerns?
A firm did not report its current total billing but had $1 million in billing last year and employs 3 architects, 1 engineer, and 17 staff members. What is the predicted total billing for this firm?
This analysis utilized the data from all commercial firms in the Indianapolis area that responded to the survey. Provide justification for the use of inference in this setting.
The following six exercises use the MOVIES data file. This data set
contains an SRS of 43 movies that are no longer playing in theaters.
This sample was collected from the Internet Movie Database (IMDb) to
see if information available soon after a movie’s theatrical release
can successfully predict total U.S. revenue.9
All dollar amounts are measured in millions of U.S. dollars.
11.37 Predicting movie revenue: Preliminary analysis. The response variable is a movie’s total U.S. revenue (USRevenue). Let’s consider as explanatory variables the movie’s budget (Budget); opening-weekend revenue (Opening); the number of theaters (Theaters) the movie was in for the opening weekend; and the movie’s IMDb rating (Ratings), which is on a 1 to 10 scale (10 being best). While this rating is updated continuously, we’ll assume that the current rating is the rating at the end of the first week.
Using numerical and graphical summaries, describe the distribution of each explanatory variable. Are there any unusual observations that should be monitored?
Using numerical and/or graphical summaries, describe the relationship between each pair of explanatory variables.
11.38 Predicting movie revenue: Simple linear regressions. Now let’s look at the response variable and its relationship with each explanatory variable.
Using numerical and graphical summaries, describe the distribution of the response variable USRevenue.
This variable is not Normally distributed. Does this violate one of the key model assumptions? Explain.
Generate scatterplots of each explanatory variable and USRevenue. Do all these relationships look linear? Explain what you see.
11.39 Predicting movie revenue: Multiple linear regression. Now consider fitting a model using all the explanatory variables.
Write out the statistical model for this analysis, making sure to specify all assumptions.
Run the multiple regression model and specify the fitted regression equation.
Obtain the residuals from part (b) and check assumptions. Comment on any unusual residuals or patterns in the residuals.
What percent of the variability in USRevenue is explained by this model?
11.40 A simpler model. In the multiple regression analysis using all four explanatory variables, Theaters and Budget appear to be the least helpful (given that the other two explanatory variables are in the model).
Perform a new analysis using only the movie’s opening-weekend revenue and IMDb rating. Give the estimated regression equation for this analysis.
What percent of the variability in USRevenue is explained by this model?
Test the null hypothesis that Theaters and Budget combined add no additional predictive information beyond what is already contained in Opening and Opinion.
11.41 Predicting U.S. movie revenue. The movie Kick-Ass was released during this same time period. It had a budget of $30.0 million and was shown in 3065 theaters, grossing $19.83 million during the first weekend.
Use software to construct a 95% prediction interval based on the model with all three explanatory variables.
Use software to construct a 95% prediction interval based on the model using only opening-weekend revenue and budget.
Compare the two intervals. Do the models give similar predictions and standard errors?
11.42 Considering the log transformation. Refer to Exercise 11.39. Variables like income often have very skewed distributions. This can result in certain cases strongly influencing the fit of the model. A common remedy is to take the log before analysis. Create a new response variable by taking the log of USRevenue and fit the model using all four predictors. Obtain the residuals and assess the model conditions. Compared to the untransformed data, do these data fit the linear regression model better? Explain your answer.
The following three exercises use the HAPPY data file. The World
Database of Happiness is an online registry of scientific research
on the subjective appreciation of life. It is available at
worlddatabaseofhappiness.eur.nl,
and the project is directed by Dr. Ruut Veenhoven, Erasmus
University, Rotterdam. One inventory presents the “average
happiness” score for various nations. This average is based on
individual responses from numerous general population surveys to a
general life satisfaction (well-being) question. Scores range from 0
(dissatisfied) to 10 (satisfied). The NationMaster website,
www.nationmaster.com,
contains a collection of statistics associated with various
nations. For our analysis, we will consider the GINI index, which
measures the degree of inequality in the distribution of income
(higher score = greater inequality), the degree of corruption in
government (higher score = less corruption), the average life
expectancy, and the degree of democracy (higher score = more civil
and political liberties).
11.43 Predicting a nation’s “average happiness” score. Consider the five statistics for each nation: LSI, the average life-satisfaction score; GINI, the GINI index; Corrupt, the degree of government corruption; Life, the average life expectancy; and Democracy, a measure of civil and political liberties.
Using numerical and graphical summaries, describe the distribution of each variable.
Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.44 Building a multiple linear regression model. Let’s now build a model to predict the life-satisfaction score, LSI.
Consider a simple linear regression using GINI as the explanatory variable. Run the regression and summarize the results. Be sure to check assumptions.
Now consider a model using GINI and Life. Run the multiple regression and summarize the results. Again be sure to check assumptions.
Now consider a model using GINI, Life, and Democracy. Run the multiple regression and summarize the results. Again be sure to check assumptions.
Now consider a model using all four explanatory variables. Again summarize the results and check assumptions.
11.45 Selecting from among several models. Refer to the results from the previous exercise.
Make a table showing the estimated regression coefficients, standard errors, t statistics, and P-values.
Describe how the coefficients and P-values change for the four models.
Based on the table of coefficients, suggest another model. Run that model, summarize the results, and compare it with the other ones. Which model would you choose to explain LSI? Explain.
The following six exercises use the BIOMARK data file. Healthy
bones are continually being renewed by two processes. Through bone
formation, new bone is built; through bone resorption, old bone is
removed. If one or both of these processes are disturbed—by disease,
aging, or space travel, for example—bone loss can be the result. The
variables
11.46 Bone formation and resorption. Consider the
following four variables:
Using numerical and graphical summaries, describe the distribution of each of these variables.
Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.47 Predicting bone formation. Let’s use regression methods to predict VO+, the measure of bone formation.
Because OC is a biomarker of bone formation, we start with a simple linear regression, using OC as the explanatory variable. Run the regression and summarize the results. Be sure to include an analysis of the residuals.
Because the processes of bone formation and bone resorption
are highly related, it is possible that there is some
information in the bone resorption variables that can tell us
something about bone formation. Use a model with both OC and
TRAP, the biomarker of bone resorption, to predict
11.48 More on predicting bone formation. Now
consider a regression model for predicting
Write out the statistical model for this analysis, including all assumptions.
Run the multiple regression to predict
Make a table giving the estimated regression coefficients, standard errors, and t statistics with P-values for this analysis and for the two that you ran in the previous exercise. Describe how the coefficients and the P-values differ for the three analyses.
Give the percent of variation in
The results you found in part (b) suggest another model. Run that model, summarize the results, and compare them with the results in part (b).
11.49 Predicting bone formation using transformed variables.
Because the distributions of
11.50 Predicting bone resorption. Refer to
Exercises 11.46, 11.47,
and 11.48.
Answer these questions with the roles of
11.51 Predicting bone resorption using transformed variables. Refer to the previous exercise. Rerun using logs.
The following 11 exercises use the PCB data file. Polychlorinated
biphenyls (PCBs) are a collection of synthetic compounds, called
congeners, that are particularly toxic to fetuses and young
children. Although PCBs are no longer produced in the United States,
they are still found in the environment. Because human exposure to
PCBs is primarily through the consumption of fish, the Environmental
Protection Agency (EPA) monitors the PCB levels in fish.
Unfortunately, there are 209 different congeners, and measuring all
of them in a fish specimen is an expensive and time-consuming
process. You’ve been asked to see if the total amount of PCBs in a
specimen can be estimated with only a few, easily quantifiable
congeners.11
If this can be done, costs can be greatly reduced.
11.52 Relationships among PCB congeners. Consider the following variables: PCB (the total amount of PCB) and four congeners: PCB52, PCB118, PCB138, and PCB180.
Using numerical and graphical summaries, describe the distribution of each of these variables.
Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.53 Predicting the total amount of PCB. Use the four congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict PCB.
Write the statistical model for this analysis. Include all assumptions.
Run the regression and summarize the results.
Examine the residuals. Do they appear to be approximately Normal? When you plot them versus each of the explanatory variables, are any patterns evident?
11.54 Adjusting the analysis for potential outliers. The examination of the residuals in part (c) of the previous exercise suggests that there may be two outliers: one with a high residual and one with a low residual.
Because of safety issues, we are more concerned about underestimating PCB in a specimen than about overestimating. Give the specimen number for each of the two suspected outliers. Which one corresponds to an overestimate of PCB?
Rerun the analysis with the two suspected outliers deleted, summarize these results, and compare them with those you obtained in the previous exercise.
11.55 More on predicting the total amount of PCB. Run a regression to predict PCB using the variables PCB52, PCB118, and PCB138. Note that this is similar to the analysis that you did in Exercise 11.53, with the change that PCB180 is not included as an explanatory variable.
Summarize the results.
In this analysis, the regression coefficient for PCB118 is not statistically significant. Give the estimate of the coefficient and the associated P-value.
Find the estimate of the coefficient for PCB118 and the associated P-value for the model analyzed in Exercise 11.53.
Using the results in parts (b) and (c), write a short paragraph explaining how the inclusion of other variables in a multiple regression can have an effect on the estimate of a particular coefficient and the results of the associated significance test.
11.56 Multiple regression model for total TEQ. Dioxins and furans are other classes of chemicals that can cause undesirable health effects similar to those caused by PCB. The three types of chemicals are combined using toxic equivalent scores (TEQs), which attempt to measure the health effects on a common scale. The PCB data file contains TEQs for PCB, dioxins, and furans. The variables are called TEQPCB, TEQDioxin, and TEQFuran. The data file also includes the total TEQ, defined to be the sum of these three variables.
Consider using a multiple regression to predict TEQ using the three components TEQPCB, TEQDioxin, and TEQFuran as explanatory variables. Write the multiple regression model in the form
Give numerical values for the parameters
The multiple regression model assumes that the
Use software to run this regression and summarize the results.
11.57 Multiple regression model for total TEQ (continued).
The information summarized in TEQ is used to assess and manage
risks from these chemicals. For example, the World Health
Organization (WHO) has established the tolerable daily intake
(TDI) as one to four TEQs per kilogram of body weight per day.
Therefore, it would be very useful to have a procedure for
estimating TEQ using just a few variables that can be measured
cheaply. Use the four PCB congeners PCB52, PCB118, PCB138, and
PCB180 in a multiple regression to predict TEQ. Give a description
of the model and assumptions, summarize the results, examine the
residuals, and write a summary of what you have found.
11.58 Predicting total amount of PCB using transformed
variables.
Because distributions of variables such as PCB, the PCB congeners,
and TEQ tend to be skewed, researchers frequently analyze the
logarithms of the measured variables. Create a data set that has
the logs of each of the variables in the PCB data file. Note that
zero is a possible value for PCB126; most software packages will
eliminate these cases when you request a log transformation.
If you do not do anything about the 16 zero values of PCB126, what does your software do with these cases? Is there an error message of some kind?
If you attempt to run a regression to predict the log of PCB using the log of PCB126 and the log of PCB52, are the cases with the zero values of PCB126 eliminated? Do you think that this is a good way to handle this situation?
The smallest nonzero value of PCB126 is 0.0052. One common practice when taking logarithms of measured values is to replace the zeros by one-half of the smallest observed value. Create a logarithm data set using this procedure; that is, replace the 16 zero values of PCB126 by 0.0026 before taking logarithms. Use numerical and graphical summaries to describe the distributions of the log variables.
11.59 Predicting total amount of PCB using transformed variables
(continued).
Refer to the previous exercise.
Use numerical and graphical summaries to describe the relationship between each pair of log variables.
Compare these summaries with the summaries that you produced in Exercise 11.53 for the measured variables.
11.60 Even more on predicting total amount of PCB using
transformed variables.
Use the log data set that you created in
Exercise 11.58
to find a good multiple regression model for predicting the log of
PCB. Use only log PCB variables for this analysis. Write a report
summarizing your results.
11.61 Predicting total TEQ using transformed variables.
Use the log data set that you created in
Exercise 11.58
to find a good multiple regression model for predicting the log of
TEQ. Use only log PCB variables for this analysis. Write a report
summarizing your results and comparing them with the results that
you obtained in the previous exercise.
11.62 Interpretation of coefficients in log PCB regressions. Use the results of your analysis of the log PCB data in Exercise 11.60 to write an explanation of how regression coefficients, standard errors of regression coefficients, and tests of significance for explanatory variables can change depending on what other explanatory variables are included in the multiple regression analysis.
The following nine exercises use the CHEESE data file. As cheddar
cheese matures, a variety of chemical processes take place. The
taste of matured cheese is related to the concentration of several
chemicals in the final product. In a study of cheddar cheese from
the LaTrobe Valley of Victoria, Australia, samples of cheese were
analyzed for their chemical composition and were subjected to taste
tests. The variable Case is used to number the observations from 1
to 30. Taste is the response variable of interest. The taste scores
were obtained by combining the scores from several tasters. Three of
the chemicals whose concentrations were measured were acetic acid,
hydrogen sulfide, and lactic acid. For acetic acid and hydrogen
sulfide, (natural) log transformations were taken. Thus, the
explanatory variables are the transformed concentrations of acetic
acid (Acetic) and hydrogen sulfide (H2S) and the untransformed
concentration of lactic acid (Lactic).12
11.63 Describing the explanatory variables. For each of the four variables in the CHEESE data file, find the mean, median, standard deviation, and interquartile range. Display each distribution by means of a stemplot and use a Normal quantile plot to assess Normality of the data. Summarize your findings. Note that when doing regressions with these data, we do not assume that these distributions are Normal. Only the residuals from our model need to be (approximately) Normal. The careful study of each variable to be analyzed is, nonetheless, an important first step in any statistical analysis.
11.64 Pairwise scatterplots of the explanatory variables. Make a scatterplot for each pair of variables in the CHEESE data file (you will have six plots). Describe the relationships. Calculate the correlation for each pair of variables and report the P-value for the test of zero population correlation in each case.
11.65 Simple linear regression model of Taste. Perform a simple linear regression analysis using Taste as the response variable and Acetic as the explanatory variable. Be sure to examine the residuals carefully. Summarize your results. Include a plot of the data with the least-squares regression line. Plot the residuals versus each of the other two chemicals. Are any patterns evident? (The concentrations of the other chemicals are lurking variables for the simple linear regression.)
11.66 Another simple linear regression model of Taste. Repeat the analysis of Exercise 11.65 using Taste as the response variable and H2S as the explanatory variable.
11.67 The final simple linear regression model of Taste. Repeat the analysis of Exercise 11.65 using Taste as the response variable and Lactic as the explanatory variable.
11.68 Comparing the simple linear regression models.
Compare the results of the regressions performed in the three
previous exercises. Construct a table with values of the
F statistic, its P-value,
11.69 Multiple regression model of Taste. Carry out a multiple regression using Acetic and H2S to predict Taste. Summarize the results of your analysis. Compare the statistical significance of Acetic in this model with its significance in the model with Acetic alone as a predictor (Exercise 11.65). Which model do you prefer? Give a simple explanation for the fact that Acetic alone appears to be a good predictor of Taste, but with H2S in the model, it is not.
11.70 Another multiple regression model of Taste. Carry out a multiple regression using H2S and Lactic to predict Taste. When we compare the results of this analysis with the simple linear regressions using each of these explanatory variables alone, it is evident that a better result is obtained by using both predictors in a model. Support this statement with explicit information obtained from your analysis.
11.71 The final multiple regression model of Taste. Use the three explanatory variables Acetic, H2S, and Lactic in a multiple regression to predict Taste. Write a short summary of your results, including an examination of the residuals. Based on all the regression analyses you have carried out on these data, which model do you prefer and why?
11.72 Is there a difference? In
Exercises 10.31
and
10.32 (page 556), we studied the relationship between room temperature and
academic performance for each of the sexes. Use what was learned
in
Exercise 11.18
(page 575)
to compare these two regression lines using one multiple
regression model. Summarize your results in a brief paragraph.
11.73 Brain injuries in Canadian football players.
Multiple concussions have been shown to be associated with
neurodegenerative diseases. In one study, the brain volumes of
the left hippocampus of 53 retired Canadian football players, 25
age- and education-matched controls, and controls from the
Centre for Aging and Neuroscience database were compared.13
Using two indicator variables
The following table summarizes the results:
Explanatory variable | b | P-value |
---|---|---|
Intercept |
|
0.350 |
|
0.412 | 0.047 |
|
0.119 | 0.347 |
Age |
|
<0.001 |
|
0.169 | 0.528 |
|
|
0.017 |
Extending what you’ve learned in Exercise 11.18 (page 575), report the three linear regressions of volume versus age.
Using the t tests for the individual coefficients, summarize what this model tells you about the left hippocampus volume and age across these three groups.
11.74 CEO pay and gross profits. In
Exercise 10.50
(page 559),
you assessed the relationship between the logarithm of a CEO’s
pay ratio and the logarithm of the company’s gross profit per
employee. These companies, however, are divided up into
different industries. Is the relationship the same for each of
these industries? The data set CNBC1 contains the centered
values for log(Ratio) and log(Profit) plus three dummy variables
and their interactions with the centered log(Ratio). Use them to
compare the linear relationships and write a short paragraph of
your findings.