Chapter 11 Exercises in Chapter 11 Looking at Data

Chapter 11 EXERCISES

11.35 Checking for a polynomial relationship. When looking at the residuals from the simple linear model of BMI versus physical activity (PA), Figure 10.5 (page 524) suggested a possible curvilinear relationship. Let’s investigate fitting a quadratic (q=2) polynomial (see Exercise 11.15, page 574) for the physical activity problem.
1. It is often best to subtract the sample mean x¯ before creating the necessary explanatory variables. In this case, the average number of steps per day is 8614. Create new explanatory variables x1=(PA−8.614) and x2=(PA−8.614)2 and run a multiple regression for BMI using the explanatory variables x1 and x2. Write down the fitted regression line.
2. The regression model that included only PA had R2=14.9%. What is R2 with the inclusion of this quadratic term?
3. Obtain the residuals from part (a) and check the multiple regression assumptions. Are there any remaining patterns in the data? Are the residuals approximately Normal? Explain.
4. Test the hypothesis that the coefficient of the variable x2 is equal to 0. Report the t statistic, degrees of freedom, and P-value. Does the quadratic term contribute significantly to the fit? Explain your answer.
11.36 Architectural firm billings. A summary of firms engaged in commercial architecture in the Indianapolis, Indiana, area provides firm characteristics, including total annual billing in the current year, total annual billing in the previous year, the number of architects, the number of engineers, and the number of staff employed in the firm.⁸ Consider developing a model to predict current total billing using the other four variables.
1. Using numerical and graphical summaries, describe the distribution of current and past year total billing and the number of architects, engineers, and staff.
2. For each of the 10 pairs of variables, use graphical and numerical summaries to describe the relationship.
3. Carry out a multiple regression. Report the fitted regression equation and the value of the regression standard error s.
4. Analyze the residuals from the multiple regression. Are there any concerns?
5. A firm did not report its current total billing but had $1 million in billing last year and employs 3 architects, 1 engineer, and 17 staff members. What is the predicted total billing for this firm?
6. This analysis utilized the data from all commercial firms in the Indianapolis area that responded to the survey. Provide justification for the use of inference in this setting.

The following six exercises use the MOVIES data file. This data set contains an SRS of 43 movies that are no longer playing in theaters. This sample was collected from the Internet Movie Database (IMDb) to see if information available soon after a movie’s theatrical release can successfully predict total U.S. revenue.⁹ All dollar amounts are measured in millions of U.S. dollars. Data set icon for movies.

11.37 Predicting movie revenue: Preliminary analysis. The response variable is a movie’s total U.S. revenue (USRevenue). Let’s consider as explanatory variables the movie’s budget (Budget); opening-weekend revenue (Opening); the number of theaters (Theaters) the movie was in for the opening weekend; and the movie’s IMDb rating (Ratings), which is on a 1 to 10 scale (10 being best). While this rating is updated continuously, we’ll assume that the current rating is the rating at the end of the first week.
1. Using numerical and graphical summaries, describe the distribution of each explanatory variable. Are there any unusual observations that should be monitored?
2. Using numerical and/or graphical summaries, describe the relationship between each pair of explanatory variables.
11.38 Predicting movie revenue: Simple linear regressions. Now let’s look at the response variable and its relationship with each explanatory variable.
1. Using numerical and graphical summaries, describe the distribution of the response variable USRevenue.
2. This variable is not Normally distributed. Does this violate one of the key model assumptions? Explain.
3. Generate scatterplots of each explanatory variable and USRevenue. Do all these relationships look linear? Explain what you see.
11.39 Predicting movie revenue: Multiple linear regression. Now consider fitting a model using all the explanatory variables.
1. Write out the statistical model for this analysis, making sure to specify all assumptions.
2. Run the multiple regression model and specify the fitted regression equation.
3. Obtain the residuals from part (b) and check assumptions. Comment on any unusual residuals or patterns in the residuals.
4. What percent of the variability in USRevenue is explained by this model?
11.40 A simpler model. In the multiple regression analysis using all four explanatory variables, Theaters and Budget appear to be the least helpful (given that the other two explanatory variables are in the model).
1. Perform a new analysis using only the movie’s opening-weekend revenue and IMDb rating. Give the estimated regression equation for this analysis.
2. What percent of the variability in USRevenue is explained by this model?
3. Test the null hypothesis that Theaters and Budget combined add no additional predictive information beyond what is already contained in Opening and Opinion.
11.41 Predicting U.S. movie revenue. The movie Kick-Ass was released during this same time period. It had a budget of $30.0 million and was shown in 3065 theaters, grossing $19.83 million during the first weekend.
1. Use software to construct a 95% prediction interval based on the model with all three explanatory variables.
2. Use software to construct a 95% prediction interval based on the model using only opening-weekend revenue and budget.
3. Compare the two intervals. Do the models give similar predictions and standard errors?
11.42 Considering the log transformation. Refer to Exercise 11.39. Variables like income often have very skewed distributions. This can result in certain cases strongly influencing the fit of the model. A common remedy is to take the log before analysis. Create a new response variable by taking the log of USRevenue and fit the model using all four predictors. Obtain the residuals and assess the model conditions. Compared to the untransformed data, do these data fit the linear regression model better? Explain your answer.

The following three exercises use the HAPPY data file. The World Database of Happiness is an online registry of scientific research on the subjective appreciation of life. It is available at worlddatabaseofhappiness.eur.nl, and the project is directed by Dr. Ruut Veenhoven, Erasmus University, Rotterdam. One inventory presents the “average happiness” score for various nations. This average is based on individual responses from numerous general population surveys to a general life satisfaction (well-being) question. Scores range from 0 (dissatisfied) to 10 (satisfied). The NationMaster website, www.nationmaster.com, contains a collection of statistics associated with various nations. For our analysis, we will consider the GINI index, which measures the degree of inequality in the distribution of income (higher score = greater inequality), the degree of corruption in government (higher score = less corruption), the average life expectancy, and the degree of democracy (higher score = more civil and political liberties). Data set icon for happy.

11.43 Predicting a nation’s “average happiness” score. Consider the five statistics for each nation: LSI, the average life-satisfaction score; GINI, the GINI index; Corrupt, the degree of government corruption; Life, the average life expectancy; and Democracy, a measure of civil and political liberties.
1. Using numerical and graphical summaries, describe the distribution of each variable.
2. Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.44 Building a multiple linear regression model. Let’s now build a model to predict the life-satisfaction score, LSI.
1. Consider a simple linear regression using GINI as the explanatory variable. Run the regression and summarize the results. Be sure to check assumptions.
2. Now consider a model using GINI and Life. Run the multiple regression and summarize the results. Again be sure to check assumptions.
3. Now consider a model using GINI, Life, and Democracy. Run the multiple regression and summarize the results. Again be sure to check assumptions.
4. Now consider a model using all four explanatory variables. Again summarize the results and check assumptions.
11.45 Selecting from among several models. Refer to the results from the previous exercise.
1. Make a table showing the estimated regression coefficients, standard errors, t statistics, and P-values.
2. Describe how the coefficients and P-values change for the four models.
3. Based on the table of coefficients, suggest another model. Run that model, summarize the results, and compare it with the other ones. Which model would you choose to explain LSI? Explain.

The following six exercises use the BIOMARK data file. Healthy bones are continually being renewed by two processes. Through bone formation, new bone is built; through bone resorption, old bone is removed. If one or both of these processes are disturbed—by disease, aging, or space travel, for example—bone loss can be the result. The variables VO+ and VO− measure bone formation and bone resorption, respectively. Osteocalcin (OC) is a biochemical marker for bone formation: higher levels of bone formation are associated with higher levels of OC. A blood sample is used to measure OC, and it is much less expensive to obtain than direct measures of bone formation. The units are milligrams of OC per milliliter of blood (mg/ml). Similarly, tartrate-resistant acid phosphatase (TRAP) is a biochemical marker for bone resorption that is also measured in blood. It is measured in units per liter (U/l). These variables were measured in a study of 31 healthy women aged 11 to 32 years.¹⁰ Variables with the first letter “L” are the logarithms of the measured variables. Data set icon for biomark.

11.46 Bone formation and resorption. Consider the following four variables: VO+, a measure of bone formation; VO−, a measure of bone resorption; OC, a biomarker of bone formation; and TRAP, a biomarker of bone resorption.
1. Using numerical and graphical summaries, describe the distribution of each of these variables.
2. Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.47 Predicting bone formation. Let’s use regression methods to predict VO+, the measure of bone formation.
1. Because OC is a biomarker of bone formation, we start with a simple linear regression, using OC as the explanatory variable. Run the regression and summarize the results. Be sure to include an analysis of the residuals.
2. Because the processes of bone formation and bone resorption are highly related, it is possible that there is some information in the bone resorption variables that can tell us something about bone formation. Use a model with both OC and TRAP, the biomarker of bone resorption, to predict VO+. Summarize the results. In the context of this model, it appears that TRAP is a better predictor of bone formation, VO+, than the biomarker of bone formation, OC. Is this view consistent with the pattern of relationships that you described in the previous exercise? One possible explanation is that, although all these variables are highly related, TRAP is measured with more precision than OC.
11.48 More on predicting bone formation. Now consider a regression model for predicting VO+ using OC, TRAP, and VO−.
1. Write out the statistical model for this analysis, including all assumptions.
2. Run the multiple regression to predict VO+ using OC, TRAP, and VO−. Summarize the results.
3. Make a table giving the estimated regression coefficients, standard errors, and t statistics with P-values for this analysis and for the two that you ran in the previous exercise. Describe how the coefficients and the P-values differ for the three analyses.
4. Give the percent of variation in VO+ explained by each of the three models and the estimate of σ. Give a short summary.
5. The results you found in part (b) suggest another model. Run that model, summarize the results, and compare them with the results in part (b).
11.49 Predicting bone formation using transformed variables. Because the distributions of VO+, VO−, OC, and TRAP tend to be skewed, it is common to work with logarithms rather than the measured values. Using the questions in the previous three exercises as a guide, analyze the log data.
11.50 Predicting bone resorption. Refer to Exercises 11.46, 11.47, and 11.48. Answer these questions with the roles of VO+ and VO− reversed; that is, run models to predict VO−, with VO+ as an explanatory variable.
11.51 Predicting bone resorption using transformed variables. Refer to the previous exercise. Rerun using logs.

The following 11 exercises use the PCB data file. Polychlorinated biphenyls (PCBs) are a collection of synthetic compounds, called congeners, that are particularly toxic to fetuses and young children. Although PCBs are no longer produced in the United States, they are still found in the environment. Because human exposure to PCBs is primarily through the consumption of fish, the Environmental Protection Agency (EPA) monitors the PCB levels in fish. Unfortunately, there are 209 different congeners, and measuring all of them in a fish specimen is an expensive and time-consuming process. You’ve been asked to see if the total amount of PCBs in a specimen can be estimated with only a few, easily quantifiable congeners.¹¹ If this can be done, costs can be greatly reduced. Data set icon for pcb.

11.52 Relationships among PCB congeners. Consider the following variables: PCB (the total amount of PCB) and four congeners: PCB52, PCB118, PCB138, and PCB180.
1. Using numerical and graphical summaries, describe the distribution of each of these variables.
2. Using numerical and graphical summaries, describe the relationship between each pair of variables.
11.53 Predicting the total amount of PCB. Use the four congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict PCB.
1. Write the statistical model for this analysis. Include all assumptions.
2. Run the regression and summarize the results.
3. Examine the residuals. Do they appear to be approximately Normal? When you plot them versus each of the explanatory variables, are any patterns evident?
11.54 Adjusting the analysis for potential outliers. The examination of the residuals in part (c) of the previous exercise suggests that there may be two outliers: one with a high residual and one with a low residual.
1. Because of safety issues, we are more concerned about underestimating PCB in a specimen than about overestimating. Give the specimen number for each of the two suspected outliers. Which one corresponds to an overestimate of PCB?
2. Rerun the analysis with the two suspected outliers deleted, summarize these results, and compare them with those you obtained in the previous exercise.
11.55 More on predicting the total amount of PCB. Run a regression to predict PCB using the variables PCB52, PCB118, and PCB138. Note that this is similar to the analysis that you did in Exercise 11.53, with the change that PCB180 is not included as an explanatory variable.
1. Summarize the results.
2. In this analysis, the regression coefficient for PCB118 is not statistically significant. Give the estimate of the coefficient and the associated P-value.
3. Find the estimate of the coefficient for PCB118 and the associated P-value for the model analyzed in Exercise 11.53.
4. Using the results in parts (b) and (c), write a short paragraph explaining how the inclusion of other variables in a multiple regression can have an effect on the estimate of a particular coefficient and the results of the associated significance test.
11.56 Multiple regression model for total TEQ. Dioxins and furans are other classes of chemicals that can cause undesirable health effects similar to those caused by PCB. The three types of chemicals are combined using toxic equivalent scores (TEQs), which attempt to measure the health effects on a common scale. The PCB data file contains TEQs for PCB, dioxins, and furans. The variables are called TEQPCB, TEQDioxin, and TEQFuran. The data file also includes the total TEQ, defined to be the sum of these three variables.
1. Consider using a multiple regression to predict TEQ using the three components TEQPCB, TEQDioxin, and TEQFuran as explanatory variables. Write the multiple regression model in the form
  
  TEQ=β0+β1TEQPCB+β2TEQDioxin+β3TEQFuran +ϵ
  
  Give numerical values for the parameters β0, β1, β2, and β3.
2. The multiple regression model assumes that the ϵ’s are Normal with mean zero and standard deviation σ. What is the numerical value of σ?
3. Use software to run this regression and summarize the results.
11.57 Multiple regression model for total TEQ (continued). The information summarized in TEQ is used to assess and manage risks from these chemicals. For example, the World Health Organization (WHO) has established the tolerable daily intake (TDI) as one to four TEQs per kilogram of body weight per day. Therefore, it would be very useful to have a procedure for estimating TEQ using just a few variables that can be measured cheaply. Use the four PCB congeners PCB52, PCB118, PCB138, and PCB180 in a multiple regression to predict TEQ. Give a description of the model and assumptions, summarize the results, examine the residuals, and write a summary of what you have found.
11.58 Predicting total amount of PCB using transformed variables. Because distributions of variables such as PCB, the PCB congeners, and TEQ tend to be skewed, researchers frequently analyze the logarithms of the measured variables. Create a data set that has the logs of each of the variables in the PCB data file. Note that zero is a possible value for PCB126; most software packages will eliminate these cases when you request a log transformation.
1. If you do not do anything about the 16 zero values of PCB126, what does your software do with these cases? Is there an error message of some kind?
2. If you attempt to run a regression to predict the log of PCB using the log of PCB126 and the log of PCB52, are the cases with the zero values of PCB126 eliminated? Do you think that this is a good way to handle this situation?
3. The smallest nonzero value of PCB126 is 0.0052. One common practice when taking logarithms of measured values is to replace the zeros by one-half of the smallest observed value. Create a logarithm data set using this procedure; that is, replace the 16 zero values of PCB126 by 0.0026 before taking logarithms. Use numerical and graphical summaries to describe the distributions of the log variables.
11.59 Predicting total amount of PCB using transformed variables (continued). Refer to the previous exercise.
1. Use numerical and graphical summaries to describe the relationship between each pair of log variables.
2. Compare these summaries with the summaries that you produced in Exercise 11.53 for the measured variables.
11.60 Even more on predicting total amount of PCB using transformed variables. Use the log data set that you created in Exercise 11.58 to find a good multiple regression model for predicting the log of PCB. Use only log PCB variables for this analysis. Write a report summarizing your results.
11.61 Predicting total TEQ using transformed variables. Use the log data set that you created in Exercise 11.58 to find a good multiple regression model for predicting the log of TEQ. Use only log PCB variables for this analysis. Write a report summarizing your results and comparing them with the results that you obtained in the previous exercise.
11.62 Interpretation of coefficients in log PCB regressions. Use the results of your analysis of the log PCB data in Exercise 11.60 to write an explanation of how regression coefficients, standard errors of regression coefficients, and tests of significance for explanatory variables can change depending on what other explanatory variables are included in the multiple regression analysis.

The following nine exercises use the CHEESE data file. As cheddar cheese matures, a variety of chemical processes take place. The taste of matured cheese is related to the concentration of several chemicals in the final product. In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples of cheese were analyzed for their chemical composition and were subjected to taste tests. The variable Case is used to number the observations from 1 to 30. Taste is the response variable of interest. The taste scores were obtained by combining the scores from several tasters. Three of the chemicals whose concentrations were measured were acetic acid, hydrogen sulfide, and lactic acid. For acetic acid and hydrogen sulfide, (natural) log transformations were taken. Thus, the explanatory variables are the transformed concentrations of acetic acid (Acetic) and hydrogen sulfide (H2S) and the untransformed concentration of lactic acid (Lactic).¹² Data set icon for cheese.

11.63 Describing the explanatory variables. For each of the four variables in the CHEESE data file, find the mean, median, standard deviation, and interquartile range. Display each distribution by means of a stemplot and use a Normal quantile plot to assess Normality of the data. Summarize your findings. Note that when doing regressions with these data, we do not assume that these distributions are Normal. Only the residuals from our model need to be (approximately) Normal. The careful study of each variable to be analyzed is, nonetheless, an important first step in any statistical analysis.
11.64 Pairwise scatterplots of the explanatory variables. Make a scatterplot for each pair of variables in the CHEESE data file (you will have six plots). Describe the relationships. Calculate the correlation for each pair of variables and report the P-value for the test of zero population correlation in each case.
11.65 Simple linear regression model of Taste. Perform a simple linear regression analysis using Taste as the response variable and Acetic as the explanatory variable. Be sure to examine the residuals carefully. Summarize your results. Include a plot of the data with the least-squares regression line. Plot the residuals versus each of the other two chemicals. Are any patterns evident? (The concentrations of the other chemicals are lurking variables for the simple linear regression.)
11.66 Another simple linear regression model of Taste. Repeat the analysis of Exercise 11.65 using Taste as the response variable and H2S as the explanatory variable.
11.67 The final simple linear regression model of Taste. Repeat the analysis of Exercise 11.65 using Taste as the response variable and Lactic as the explanatory variable.
11.68 Comparing the simple linear regression models. Compare the results of the regressions performed in the three previous exercises. Construct a table with values of the F statistic, its P-value, R2, and the estimate s of the standard deviation for each model. Report the three regression equations. Why are the intercepts in these three equations different?
11.69 Multiple regression model of Taste. Carry out a multiple regression using Acetic and H2S to predict Taste. Summarize the results of your analysis. Compare the statistical significance of Acetic in this model with its significance in the model with Acetic alone as a predictor (Exercise 11.65). Which model do you prefer? Give a simple explanation for the fact that Acetic alone appears to be a good predictor of Taste, but with H2S in the model, it is not.
11.70 Another multiple regression model of Taste. Carry out a multiple regression using H2S and Lactic to predict Taste. When we compare the results of this analysis with the simple linear regressions using each of these explanatory variables alone, it is evident that a better result is obtained by using both predictors in a model. Support this statement with explicit information obtained from your analysis.
11.71 The final multiple regression model of Taste. Use the three explanatory variables Acetic, H2S, and Lactic in a multiple regression to predict Taste. Write a short summary of your results, including an examination of the residuals. Based on all the regression analyses you have carried out on these data, which model do you prefer and why?

PUTTING IT ALL TOGETHER

11.72 Is there a difference? In Exercises 10.31 and 10.32 (page 556), we studied the relationship between room temperature and academic performance for each of the sexes. Use what was learned in Exercise 11.18 (page 575) to compare these two regression lines using one multiple regression model. Summarize your results in a brief paragraph.

11.73 Brain injuries in Canadian football players. Multiple concussions have been shown to be associated with neurodegenerative diseases. In one study, the brain volumes of the left hippocampus of 53 retired Canadian football players, 25 age- and education-matched controls, and controls from the Centre for Aging and Neuroscience database were compared.¹³ Using two indicator variables (C1=1 if matched control and C2=1 if retired football player), the researchers compared three simple linear regressions by fitting the model

volume=β0+β1C1+β2C2+β3Age+β4C1Age+β5C2Age +ϵ

The following table summarizes the results:

Explanatory variable	b	P-value
Intercept	−0.046	0.350
C1	0.412	0.047
C2	0.119	0.347
Age	−0.484	<0.001
C1Age	0.169	0.528
C2Age	−0.351	0.017

Extending what you’ve learned in Exercise 11.18 (page 575), report the three linear regressions of volume versus age.
Using the t tests for the individual coefficients, summarize what this model tells you about the left hippocampus volume and age across these three groups.

11.74 CEO pay and gross profits. In Exercise 10.50 (page 559), you assessed the relationship between the logarithm of a CEO’s pay ratio and the logarithm of the company’s gross profit per employee. These companies, however, are divided up into different industries. Is the relationship the same for each of these industries? The data set CNBC1 contains the centered values for log(Ratio) and log(Profit) plus three dummy variables and their interactions with the centered log(Ratio). Use them to compare the linear relationships and write a short paragraph of your findings.