Chapter 2 CHECK-IN QUESTIONS

  1. 2.1

    1. The cases are students.
    2. Number of friends and time (average amount spent on Facebook per week).
    3. Both variables are quantitative because both are numerical, and arithmetic operations are possible.
  2. 2.3 Cases: cups of Mocha Frappuccino. Variables: size and price (both quantitative)

  3. 2.5

    1. There are 52 cases.
    2. The variables are Price, Rating, and Type.
    3. Price and Rating are quantitative; Type is categorical.
  4. 2.7 Answers will vary.

  5. 2.9 The size in ounces is the explanatory variable, which should explain or cause changes in the cost. The scatterplot shows that there is a relationship: as ounces increase, so does the cost.

  6. 2.11 r=0.1547.

  7. 2.13

    1. The slope is 30.
    2. The intercept is 20.
    3. When x=0, y=20. When x=30, y=920. When x=60, y=1820.
  8. 2.15 When a country’s MI is $35, its predicted IDI=4.96.

  9. 2.17 There are 7 points above the line and 7 points below the line.

  10. 2.19 The values of r2 are given in the table below. As the correlation r moves away from 0, the value of r2 increases.

    r 1.0 0.8 0.4 0.2 0 0.2 0.4 0.8
    r2 1 0.64 0.16 0.04 0 0.04 0.16 0.64
  11. 2.21

    1. For x=28.7, y^=1.38017+2.82654(28.7)=82.50.
    2. Residual=12.8.
    3. Texas is further from the regression line because the residual is bigger in magnitude.
  12. 2.23

    1. It is easy to identify the points because both the scatterplot and residual plot have the x variable, population, on the x axis.
  13. 2.25 There were 974 children aged 11 to 13. There were 1278 total children who met the requirement.

  14. 2.27 Each entry is the row total divided by the table total. For No, 751/2029=0.3701, or 37.01%. For Yes, 1278/2029=0.6299, or 62.99%.

  15. 2.29 Yes: 417/974=0.4281, or 42.81%. No: 557/974=0.5719, or 57.19%.

Chapter 2 EXERCISES

  1. 2.1

    1. Tweets.
    2. Click count, length of tweet, and outside temperature are quantitative. Day of week and rain are categorical. Time of day could be quantitative (in the hours/minutes format hh:mm) or categorical (if morning, afternoon, etc.).
    3. Click counts is the response because the other variables can possibly explain the number of click counts. The others could all be potentially explanatory because they can be controlled by the researcher or are already set.
  2. 2.3 Answers will vary. Some possible variables are price, type of textbook, major, third or fourth year course, etc.

  3. 2.5 Answers will vary. Some possible variables are university, size, etc., in addition to the average number of tickets sold and the percentage of games won. Cases would be individual teams. Here, we would likely be interested in whether there is a relationship between the average number of tickets sold and the percentage of games won.

  4. 2.7 Answers will vary. This could include enrollment, graduation rate, job placement rate, in-state tuition, out-of-state tuition, public/private institution, etc.

  5. 2.9

    1. The form is somewhat linear; the direction is positive; the strength is still weak.
    2. There are a few possible low outliers for Antho3.
    3. Adding a line could be very useful because the relationship is somewhat linear.
    4. Smoothing does not contribute much in describing the relationship.
  6. 2.11

    1. The form is linear; the direction is positive; the strength is very strong.
    2. There is one outlier.
    3. Yes, the line shows the direction and strength.
    4. Smoothing does not help at all because the relationship is quite linear.
  7. 2.13

    1. For all fuel types, as highway fuel consumption increases, so do carbon dioxide emissions.
    2. Vehicles with fuel type D have the largest emissions, while vehicles with fuel type E have the smallest emissions. The other two types, X and Z, have fairly similar emissions.
  8. 2.15

    1. As nondominant arm strength increases, so does dominant arm strength. There is one outlier with an extremely high nondominant arm strength.
    2. Linear; positive; strong.
    3. There is one outlier.
    4. Yes, the relationship is linear except for the outlier.
  9. 2.17 Answers will vary. The explanatory variable is the major. The response variable is graduating in four years. Both variables are categorical, therefore the methods described in this section cannot be used.

  10. 2.19

    1. Smoothing does not help.
  11. 2.21 The relationships between calories and alcohol content are quite similar for both domestic and imported beers. Also, the outlier for the imported beers no longer is an outlier because there are several other domestic beers that have a similar alcohol content.

  12. 2.23

    1. As time increases, log count goes down.
    2. Linear; negative; strong.
    3. There are no outliers.
    4. The relationship is very linear.
  13. 2.25

    1. The plot is more linear than the original scatterplot.
    2. The log transformed data should be preferred because it straightens out the relationship.
  14. 2.27

    1. The association is positive, linear, and strong.
    2. Overall the relationship is strong, but it is stronger for women than for men. Male subjects generally have both greater lean body mass and higher metabolic rates than women.
  15. 2.29

    1. No linear relationship.
    2. Strong and positive.
    3. Strong and negative.
    4. Weak and negative.
  16. 2.31

    1. r=0.298.
    2. Probably; the plot is somewhat linear.
    3. No, if they were approximately equal, the correlation would be closer to 1.
  17. 2.33 For fuel type D: r=0.9707; for fuel type E: r=0.9635; for fuel type X: r=0.9746; for fuel type Z: r=0.9658. For all four fuel types: the correlations are around 0.97, and the relationship between CO2 emissions and highway fuel consumption is linear and very strong.

  18. 2.35

    1. r=0.905.
    2. Yes, because the pattern is very linear. There is one outlier, but it fits the overall pattern.
  19. 2.37 There is little linear association between research and teaching; for example, knowing that a professor is a good researcher gives little information about whether she or he is a good or bad teacher.

  20. 2.39

    1. r=0.999.
    2. Yes, because the scatterplot is very strongly linear.
    3. You must be careful; there can be a strong correlation between two variables even when the relationship is curved. Plot the data first!
  21. 2.41

    1. r=0.9051.
    2. Yes, because it is quite linear; however, there is one outlier in this data set, O’Doul’s, with an extremely low alcohol percent.
  22. 2.43 Both correlations for the imported and domestic beers are quite similar, especially when the outlier O’Doul’s is removed. The relationships between calories and percent alcohol for both types of beers are linear and very strong and quite similar in pattern.

  23. 2.45

    1. With only two points, the correlation will be 1 or 1, because they form a perfect straight line.
    • (b–d) Answers will vary.
  24. 2.47

    1. r=0.73011.
    2. The correlation is not a good numerical summary for this relationship because there is a curvature in the plot.
  25. 2.49

    1. y^=0.8302+0.1755Antho4.
    1. The line does not fit the data well.
    2. y^=1.12851.
  26. 2.51

    1. y^=10.494+29.066FuelConsHwy.
    1. A single regression line would not be a good fit for the four types of vehicles, even though the correlations were all very close. There are several different lines that need to be accounted for.
  27. 2.53

    1. There is a strong linear relationship. For each unit increase in nondominant, the dominant arm bone strength increases by 1.373.
  28. 2.55 Predicted bone strength is 22.854 cm4/1000.

  29. 2.57

    • (a) – (c)

      Count=602.8(74.7×time)
      Time Count Predicted (a) Difference (b) Squared difference (c)
      1 578 528.1  49.9 2490.01
      3 317 378.7 61.7 3806.89
      5 203 229.3 26.3  691.69
      7 118  79.9  38.1 1451.61
    1. Count=500(100×time)
      Predicted Difference Squared Difference
       400 178  31,684
       200 117  13,689
         0 203  41,209
      200 318 101,124

    2. The first line is a better description of the relationship.

  30. 2.59

    1. The relationship is linear, positive, and strong. There are several outliers.
    2. y^=15057+0.05326x.
  31. 2.61

    1. y^=197,993.
    2. y^=202,327.
    3. The outliers didn’t change the prediction for the median-sized state.
  32. 2.63

    1. y^=8491.907+0.048x.
    2. r2=0.942.
    3. 94.2% of the variation in the number of undergraduates is accounted for by the population size.
    4. The software does not report nature of the relationship.
  33. 2.65

    1. There is a weak negative linear relationship but with one extreme outlier.
    2. y^=31.370.0867x.
    3. r2=0.0179.
    4. The x variable accounts for 1.79% of the variation in y.
  34. 2.67

    1. y^=5.7709+2858.2x .
    2. r2=0.8193.
    3. The relationship between calories and percent alcohol is linear, positive, and strong; however, there does seem to be one low outlier, O’Doul’s, with a very low alcohol content.
  35. 2.69

    1. The correlations and regression lines for all four data sets are essentially the same: r=0.82 and y^=3+0.5x. For x=10, y^=8.
    1. Only for Data Set A should regression be used.
  36. 2.71

    1. y=1.
    2. y increases by 8.
    3. The intercept is 25.
  37. 2.73 y^=0.896x1517.935. r=0.982. We can conclude that NAEP scores are steadily increasing about 0.896 point per year.

  38. 2.75 r=0.5.

  39. 2.77 The residuals are 10.58, 0.13, 7.24, 0.14.

  40. 2.79 10.0, extrapolation; 13.0, 16.0, 19.0, 30.0, prediction.

  41. 2.81

    • (a) – (b)

      Time LogCount Predicted Residual
      1 6.35957 6.33244  0.02713
      3 5.75890 5.81121 0.05231
      5 5.31321 5.28997  0.02324
      7 4.77068 4.76874  0.00195
    1. The residual plot looks random; the model using logs is much better.
  42. 2.83

    1. No.
    2. No.
    3. No.
    4. California is not an outlier and does not influence the regression line using the log transformations.
  43. 2.85

    1. This is not extrapolation because 13.5 is within the range of x.
    2. An influential observation can have a small residual.
    3. Correlation does not imply causation.
    4. The residual is 5.
  44. 2.87 Internet use does not cause people to have fewer babies. Possible lurking variables are economic status of the country, levels of education, etc.

  45. 2.89 For example, a reasonable explanation is that the cause-and-effect relationship goes in the other direction: doing well makes students or workers feel good about themselves rather than vice versa.

  46. 2.91 The explanatory and response variables were “consumption of herbal tea” and “cheerfulness/health.” The most important lurking variable is social interaction; many of the nursing-home residents may have been lonely before the students started visiting.

  47. 2.93

    1. It is difficult to draw the correct line by hand.
    2. Most people tend to overestimate the slope for a scatterplot with r=0.6; that is, most students will find that the least-squares line (the one without the ending dots) is less steep than the one they draw.
  48. 2.95 Each group has a positive association, but when combined, the regression slope is negative.

  49. 2.97

    1. Because they want to see the effect of driver’s education courses, that is the explanatory variable. The number of accidents is the response.
    2. Driver’s Ed would be the column (x) variable, and number of accidents would be the row (y) variable.
    3. There are six cells (two columns by three rows). For example, the first row, first column entry could be the number who took driver’s education and had 0 accidents.
  50. 2.99

    1. Age is the explanatory variable. Rejected is the response. With the dentistry available at that time, it’s reasonable to think that as a person got older, he would have lost more teeth.
    2. Under 20 20 to 25 25 to 30 30 to 35 35 to 40 Over 40
      Yes 0.0002 0.0019 0.0033 0.0053 0.0086 0.0114
      No 0.1761 0.2333 0.1663 0.1316 0.1423 0.1196

    3. Marginal distribution of rejected
      Yes No
      0.03081 0.96919
      Marginal distribution of age
      Under 20 20 to 25 25 to 30 30 to 35 35 to 40 Over 40
      0.1763 0.2352 0.1696 0.1369 0.1509 0.131

    4. The conditional distribution of Rejected given Age, because we have said Age is the explanatory variable.

    5. In the table, note that all columns sum to 1. We can clearly see the proportion of rejected recruits increasing with increasing age.

      Under 20 20 to 25 25 to 30 30 to 35 35 to 40 Over 40
      Yes 0.0012 0.0082 0.0196 0.0389 0.0572 0.0868
      No 0.9988 0.9918 0.9804 0.9611 0.9428 0.9132
  51. 2.101 Sex is the explanatory variable, and Lied is the response variable. For the males, about 55% admitted that they had lied, whereas for the females, 51% admitted that they had lied. Males may be slightly more willing to admit that they lied than females.

  52. 2.103

    1. 50.5% get enough sleep, and 49.5% do not.
    2. 32.2% get enough sleep, and 67.8% do not.
    3. Those who exercise more than the median are more likely to get enough sleep.
  53. 2.105 3.0% of Hospital A’s patients died, compared with 2.0% at B.

  54. 2.107 In general, choose a to be any number from 0 to 300, and then all the other entries can be determined.

  55. 2.109 For example, causation might be a negative association between the setting on a stove and the time required to boil a pot of water (higher setting, less time). Common response might be a positive association between SAT score and grade point average. Both of these will have a positive relationship with a person’s IQ. An example of confounding might be a negative association between hours of TV watching and grade point average. Once again, people who are naturally smart could finish required work faster and have more time for TV; those who aren’t as smart could become frustrated and watch TV instead of doing homework.

  56. 2.111 This is a case of confounding: the association between dietary iron and anemia is difficult to detect because malaria and helminths also affect iron levels in the body.

  57. 2.113 Responses will vary. For example, students who choose the online course might have more self-motivation or better computer skills. The generic “Student characteristics” might be replaced with something more specific.

  58. 2.115 No; self-confidence and improving fitness could be common responses to some other personality trait, or high self-confidence could make a person more likely to join the exercise program.

  59. 2.117 Patients suffering from more serious illnesses are more likely to go to larger hospitals (which may have more or better facilities) for treatment. They are also likely to require more time to recuperate afterward.

  60. 2.119 People who are overweight are more likely to be on diets and so choose artificial sweeteners over sugar.

  61. 2.121 This is an observational study: students choose their “treatment” (to take or not take the refresher sessions).

  62. 2.123

    1. There is no linear relationship.
    2. y^=105.16+0.0193DwellPermit.
    3. For each new index point of dwelling permits issued, production increases by 0.0193.
    4. 105.16; this is what we would expect sales to be when the index for permits issued for new dwellings is 0.
    5. y^=107.43.
    6. e=10.37.
    7. r2=2.97%.
  63. 2.125

    1. As the percent under 15 increases, the percent of the population over 65 decreases.
    2. r=0.91. The correlation gives a pretty good representation of the relationship; however, there is an outlier, Nunavut.
  64. 2.127

    1. The three territories have a smaller population than any of the provinces. Additionally, two of the three territories have larger percentages of the population under 15 than any of the provinces.
  65. 2.129

    1. The marginal totals are SBL: 1688; SME: 911; AH: 801; Ed: 319; Other: 857. By country, Canada: 176; France: 672; Germany: 218; Italy: 321; Japan: 645; UK: 475; U.S.: 2069.
    2. Canada: 3.85%; France: 14.7%; Germany: 4.8%; Italy: 7%; Japan: 14.1%; UK: 10.4%; U.S.: 45.2%.
    3. SBL: 36.9%; SME: 19.9%; AH: 17.5%; Ed: 7.0%; Other: 18.7%.
  66. 2.131 A school that accepts weaker students but graduates a higher-than-expected number of them would have a positive residual, whereas a school with a stronger incoming class but a lower-than-expected graduation rate would have a negative residual. It seems reasonable to measure school quality by how much benefit students receive from attending the school.

  67. 2.133

    1. The residuals are positive at the beginning and end, and they are negative in the middle.
    2. The behavior of the residuals agrees with the curved relationship shown in Figure 2.38.
  68. 2.135

    1. y^=41.253+3.9331Year; for year 26, the predicted salary is 143.79, or about $143,513.
    2. Using logs: y^=3.8675+0.04832Year. At Year 26, we predict 5.1237, or about $167,956.
    3. Although both predictions involve extrapolation, the second is more reliable because it is based on a linear fit to a linear relationship.
    4. Interpreting relationships without a plot is risky.
  69. 2.137

    1. y^=3009.4+0.9996Salary2019_20.
    2. There are two outliers in the y direction.
  70. 2.139 Number of firefighters and amount of damage are common responses to the seriousness of the fire.

  71. 2.141 There is a strong linear positive association. y^=2652.66+0.00474MOE; r2=0.6217. We can use MOE to get fairly good predictions of MOR.

  72. 2.143

    1. Table shown here.
    2. Males: 70% admitted. Females: 56% admitted.
    3. Business: 80% of males and 90% of females. For Law, 10% of males and 33% of females.
    4. 75% of men apply to business school, where admission is easier. More women apply to law school, which is more selective.

      Sex Admit Deny Total
      Male 490 310  800
      Female 400 300  700
      Total 890 610 1500
  73. 2.145 If we ignore “Year,” Department A teaches 61.54% small classes, and Department B teaches 39.62% small classes. However, in upper-level classes, A has 77.5% and B has 83.33% small classes. Additionally, 76.92% of A’s classes are upper-level courses, compared to 33.96% of B’s classes.

  74. 2.147

    1. People have mixed feelings on the quality of the recycled filters.
    2. 55.6% of buyers think the quality is higher, while 44.3% of nonbuyers think the quality lower. It is plausible that using the filters may cause more favorable opinions.
  75. 2.149 Answers will vary.

  76. 2.151

    1. The tables are shown here.

      Female Titanic passengers
      Class Total
      1 2 3
      Survived 139  94 106 339
      Died   5  12 110 127
      Total 144 106 216 466
      Male Titanic passengers
      Class Total
      1 2 3
      Survived  61  25  75 161
      Died 118 146 418 682
      Total 179 171 493 843
    2. If we look at the conditional distribution of survival given class for females, 96.53% of first-class females survived, 88.68% survival among second-class females, and 49.07% survival among third-class females. Survival depended on class.

    3. For males, 34.08% survival among first class, 14.62% survival among second class, and 15.21% survival among third class. Once again, survival depended on class.

    4. Females overall had much higher survival rates than males.

  77. 2.153 Answers will vary.