Chapter 5 Exercises in Chapter 5 Looking at Data

Chapter 5 EXERCISES

5.44 The cost of Internet access. In Canada, households spent an average of $54.17 CDN monthly for high-speed Internet access.²⁴ Assume that the standard deviation is $17.83. If you ask an SRS of 500 Canadian households with high-speed Internet how much they pay, what is the probability that the average amount will exceed $55?
5.45 Dust in coal mines. A laboratory weighs filters from a coal mine to measure the amount of dust in the mine atmosphere. Repeated measurements of the weight of dust on the same filter vary Normally, with standard deviation σ=0.11 milligram (mg) because the weighing is not perfectly precise. The dust on a particular filter actually weighs 137 mg.
1. The laboratory reports the mean of three weighings of this filter. What is the distribution of this mean?
2. What is the probability that the laboratory will report a weight of 137.12 mg or higher for this filter?
5.46 The effect of sample size on the standard deviation. Assume that the standard deviation in a very large population is 100.
1. Calculate the standard deviation for the sample mean for samples of size 1, 4, 25, 100, 250, 500, 1000, and 5000.
2. Graph your results with the sample size on the x axis and the standard deviation on the y axis.
3. Summarize the relationship between the sample size and the standard deviation that your graph shows.
5.47 Monitoring the emerald ash borer. The emerald ash borer is a beetle that poses a serious threat to ash trees. Purple traps are often used to detect or monitor populations of this pest. In the counties of your state where the beetle is present, thousands of traps are used to monitor the population. These traps are checked periodically. The distribution of beetle counts per trap is discrete and strongly skewed. A majority of traps have no beetles, and only a few will have more than two beetles. For this exercise, assume that the mean number of beetles trapped is 0.43, with a standard deviation of 0.95.
1. Suppose that your state does not have the resources to check all the traps, so it plans to check only an SRS of n=150 traps. What are the mean and standard deviation of the average number of beetles x¯ in 150 traps?
2. Use the central limit theorem to find the probability that the average number of beetles in 150 traps is greater than 0.55.
3. Do you think it is appropriate in this situation to use the central limit theorem? Explain your answer.
5.48 Attitudes toward drinking and studies of behavior. Some of the methods in this chapter are based on approximations rather than exact probability results. We have given rules of thumb for safe use of these approximations.
1. You are interested in attitudes toward drinking among the 75 members of a fraternity. You choose 30 members at random to interview. One question is “Have you had five or more drinks at one time during the past week?” Suppose that, in fact, 30% of the 75 members would say Yes. Explain why you cannot safely use the B(30, 0.3) distribution for the count X in your sample who say Yes.
2. The National AIDS Behavioral Surveys found that 0.2% (that’s 0.002 as a decimal fraction) of adult heterosexuals had both received a blood transfusion and had a sexual partner from a group at high risk of AIDS. Suppose that this national proportion holds for your region. Explain why you cannot safely use the Normal approximation for the sample proportion who fall in this group when you interview an SRS of 1000 adults.

5.49 Benford’s law. It is a striking fact that the first digits of numbers in legitimate records often follow a distribution known as Benford’s law. Here it is:

First digit	1	2	3	4	5	6	7	8	9
Proportion	0.301	0.176	0.125	0.097	0.079	0.067	0.058	0.051	0.046

Fake records usually have fewer first digits 1, 2, and 3. What is the approximate probability, if Benford’s law holds, that among 1000 randomly chosen invoices there are 575 or fewer amounts with first digit 1, 2, or 3?

5.50 Watching live television. A survey of 442 people aged 18 to 29 revealed that 30% watch live television every day.²⁵ You take a random sample of 20 undergraduates from your university and ask them whether they watch live TV every day. If their rate matches the 30% rate:
1. What is the distribution of the number of students who say they watch live television every day?
2. What is the distribution of the number of students who say that they do not watch live television every day?
3. What is the probability that no more than 2 of the 20 students in your sample say that they watch live television every day?
5.51 Leaking gas tanks. Leakage from underground gasoline tanks at service stations can damage the environment. It is estimated that 25% of these tanks leak. You examine 15 tanks chosen at random, independently of each other.
1. What is the mean number of leaking tanks in such a sample of 15?
2. What is the probability that 10 or more of the 15 tanks leak?
3. Now you do a larger study, examining a random sample of 2000 tanks nationally. What is the probability that at least 540 of these tanks are leaking?
5.52 Watching live television, continued. Refer to Exercise 5.50. You think that the undergraduate rate of those who watch live television every day at your university is 15%.
1. Using this rate, what is the expected number of students in your sample who say that they watch live television every day? What is the expected number of students who say that they do not watch live television every day? You should see that these two means add to 20, the total number of students.
2. What is the probability that no more than 2 of the 20 students in your sample say that they watch live television every day?
3. Based on your answer to part (b) and your answer to part (c) of Exercise 5.50, which of the two rates (30% or 15%) is more supported by an observed count X=2? Explain your answer.
5.53 Marks per round in cricket. Cricket is a dart game that uses the numbers 15 to 20 and the bull’s-eye. Each time you hit one of these regions, you score either 0, 1, 2, or 3 marks. Thus, in a round of three throws, a person can score 0 to 9 marks. Lex plans to play 20 games. Her distribution of marks per round is discrete and strongly skewed. A majority of her rounds result in 0, 1, or 2 marks, and only a few are more than 4 marks. Assume that her mean is 2.21 marks per round, with a standard deviation of 1.90, and that her 20 games will involve 140 rounds.
1. What are the mean and standard deviation of the average number of marks x¯ in 140 rounds?
2. Using the central limit theorem, what is the probability that Lex averages fewer than 2 marks per round?
3. Do you think that the probability obtained in part (b) is good approximation to the true probability in this setting? Explain your answer.
5.54 Common last names. The U.S. Census Bureau says that the 10 most common names in the United States are (in order) Smith, Johnson, Williams, Brown, Jones, Garcia, Miller, Davis, Rodriguez, and Martinez.²⁶ These names account for 4.9% of all U.S. residents. Out of curiosity, you look at the authors of the textbooks for your current courses. There are 12 authors in all. Would you be surprised if none of the names of these authors were among the 10 most common? Give a probability to support your answer and explain the reasoning behind your calculation.
5.55 Use the Normal approximation. Suppose that we toss a fair coin. Use the Normal approximation to find the probability that the sample proportion of heads is
1. between 0.45 and 0.55 when n=100.
2. between 0.48 and 0.52 when n=625.
3. Use these results to describe the relationship between the sample size and the precision of the estimate p^.
5.56 Use the Probability applet. The Probability applet simulates tosses of a coin. You can choose the number of tosses n and the probability p of a head. You can therefore use the applet to simulate binomial random variables.

The count of misclassified sales records in Example 5.20 has the binomial distribution with n=15 and p=0.08. Set these values for the number of tosses and probability of heads in the applet. Table C shows that the probability of getting a sample with exactly 0 misclassified records is 0.2863. This is the long-run proportion of samples with no bad records. Toss and reset repeatedly to simulate 25 samples of 15 tosses. Record the number of bad records (the count of heads) in each of the 25 samples.
1. What proportion of the 25 samples had exactly 0 bad records? Do you think this sample proportion is close to the probability?
2. Remember that this probability of 0.2863 tells us only what happens in the long run. Here we’re considering only 25 samples. If X is the number of samples out of 25 with exactly 0 misclassified records, what is the distribution of X?
3. Explain how to use the distribution in part (b) to describe the sampling distribution of p^ in part (a).
5.57 A random walk. A particle moves along the line in a random walk. That is, the particle starts at the origin (position 0) and moves either right or left in independent steps of length 1. If the particle moves to the right with probability 0.6, its movement at the ith step is a random variable Xi with distribution

P(Xi=1)=0.6P(Xi=-1)=0.4

The position of the particle after k steps is the sum of these random movements,

Y=X1+X2+⋯+Xk

Use the central limit theorem to find the approximate probability that the position of the particle after 500 steps is at least 200 to the right.
5.58 Tossing a die. You are tossing a balanced die that has probability 1/6 of coming up 1 on each toss. Tosses are independent. We are interested in how long we must wait to get the first 1.
1. The probability of a 1 on the first toss is 1/6. What is the probability that the first toss is not a 1 and the second toss is a 1?
2. What is the probability that the first two tosses are not 1s and the third toss is a 1? This is the probability that the first 1 occurs on the third toss.
3. Now you see the pattern. What is the probability that the first 1 occurs on the fourth toss? On the fifth toss?
5.59 The geometric distribution. Generalize your work in the previous exercise. You have independent trials, each resulting in a success or a failure. The probability of a success is p on each trial. The binomial distribution describes the count of successes in a fixed number of trials. Now the number of trials is not fixed; instead, continue until you get a success. The random variable Y is the number of the trial on which the first success occurs. What are the possible values of Y? What is the probability P(Y=k) for any of these values?
5.60 Wi-Fi interruptions. Suppose that the number of Wi-Fi interruptions on your home network follows the Poisson distribution, with an average of 1.6 Wi-Fi interruptions per day.
1. Show that the probability of no interruptions on a given day is 0.2019.
2. Treating each day as a trial in a binomial setting, use the binomial formula to compute the probability of no interruptions in a week.
3. Now, instead of using the binomial model, let’s use the Poisson distribution exclusively. What is the mean number of Wi-Fi interruptions during a week?
4. Based on the Poisson mean of part (c), use the Poisson distribution to compute the probability of no interruptions in a week. Confirm that this probability is the same as found part (b). Explain in words why the two ways of computing no interruptions in a week give the same result.
5. Explain why using the binomial distribution to compute the probability that only one day in the week will not be interruption free would not give the same probability had we used the Poisson distribution to compute that only one interruption occurs during the week.
5.61 Poisson distribution? Suppose you find in your spam folder an average of two spam emails every 10 minutes. Furthermore, you find that the rate of spam mail from midnight to 6 a.m. is twice the rate during other parts of the day. Explain whether or not the Poisson distribution is an appropriate model for the spam process.

5.62 A lottery payoff. A $1 bet in a state lottery’s Pick 3 game pays $500 if the three-digit number you choose exactly matches the winning number, which is drawn at random. Here is the distribution of the payoff X:

Payoff X	$0	$500
Probability	0.999	0.001

Each day’s drawing is independent of other drawings.

Joe buys a Pick 3 ticket twice a week. The number of times he wins follows a B(104, 0.001) distribution. Using the Poisson approximation to the binomial, what is the probability that he wins at least once?
The exact binomial probability is 0.0988. How accurate is the Poisson approximation here?
If Joe pays $5 a ticket, he needs to win at least twice a year to come out ahead. Using the Poisson approximation, what is the probability that Joe will come out ahead?

5.63 A test for ESP. In a test for ESP (extrasensory perception), the experimenter looks at cards that are hidden from the subject. Each card contains either a star, a circle, a wave, or a square. As the experimenter looks at each of 20 cards in turn, the subject names the shape on the card.
1. If a subject simply guesses the shape on each card, what is the probability of a successful guess on a single card? Because the cards are independent, the count of successes in 20 cards has a binomial distribution.
2. What is the probability that a subject correctly guesses at least 10 of the 20 shapes?
3. In many repetitions of this experiment with a subject who is guessing, how many cards will the subject guess correctly, on the average? What is the standard deviation of the number of correct guesses?
4. A standard ESP deck actually contains 25 cards. There are 5 different shapes, each of which appears on 5 cards. The subject knows that the deck has this makeup. Is a binomial model still appropriate for the count of correct guesses in one pass through this deck? If so, what are n and p? If not, why not?

5.64 A roulette payoff. A $1 bet on a single number on a casino’s roulette wheel pays $35 if the ball ends up in the number slot you choose. Here is the distribution of the payoff X:

Payoff X	$0	$35
Probability	0.974	0.026

Each spin of the roulette wheel is independent of other spins.

What are the mean and standard deviation of X?
Sam comes to the casino weekly and bets on 10 spins of the roulette wheel. What does the law of large numbers say about the average payoff Sam receives from his bets each visit?
What does the central limit theorem say about the distribution of Sam’s average payoff after betting on 520 spins in a year?
Sam comes out ahead for the year if his average payoff is greater than $1 (the amount he bet on each spin). What is the probability that Sam ends the year ahead? The true probability is 0.396. Does using the central limit theorem provide a reasonable approximation?

5.65 A roulette payoff revisited. Refer to the previous exercise. In part (d), the central limit theorem was used to approximate the probability that Sam ends the year ahead. The estimate was about 0.10 too large. Let’s see if we can get closer using the Normal approximation to the binomial with the continuity correction.
1. If Sam plans to bet on 520 roulette spins, he needs to win at least $520 to break even. If each win gives him $35, what is the minimum number of wins m he must have?
2. Given p=1/38=0.026, what are the mean and standard deviation of X, the number of wins in 520 roulette spins?
3. Use the information in the previous two parts to compute P(X≥m) with the continuity correction. Does your answer get closer to the exact probability 0.396?
5.66 Learning a foreign language. Does delaying oral practice hinder learning a foreign language? Researchers randomly assigned 25 beginning students of Russian to begin speaking practice immediately and another 25 to delay speaking for four weeks. At the end of the semester both groups took a standard test of comprehension of spoken Russian. Suppose that in the population of all beginning students, the test scores for early speaking vary according to the N(32, 6) distribution and scores for delayed speaking have the N(29, 5) distribution.
1. What is the sampling distribution of the mean score x¯ in the early-speaking group in many repetitions of the experiment? What is the sampling distribution of the mean score y¯ in the delayed-speaking group?
2. If the experiment were repeated many times, what would be the sampling distribution of the difference y¯-x¯ between the mean scores in the two groups?
3. What is the probability that the experiment will find (misleadingly) that the mean score for delayed speaking is at least as large as that for early speaking?
5.67 Summer employment of college students. Suppose (as is roughly true) that 88% of college men and 82% of college women were employed last summer. A sample survey interviews SRSs of 400 college men and 400 college women. The two samples are, of course, independent.
1. What is the approximate distribution of the proportion p^F of women who worked last summer? What is the approximate distribution of the proportion p^M of men who worked?
2. The survey wants to compare men and women. What is the approximate distribution of the difference in the proportions who worked, p^M-p^F? Explain the reasoning behind your answer.
3. What is the probability that in the sample a higher proportion of women than men worked last summer?
5.68 Income of working couples. A study of working couples measures the income X of the husband and the income Y of the wife in a large number of couples in which both partners are employed. Suppose that you knew the means μX and μY and the variances σX2 and σY2 of both variables in the population.
1. Is it reasonable to take the mean of the total income X+Y to be μX+μY? Explain your answer.
2. Is it reasonable to take the variance of the total income to be σX2+σY2? Explain your answer.
5.69 More on watching live television. Consider the settings of Exercises 5.50 and 5.52.
1. Using the reported 30% from the survey, what is the largest number m out of n=20 undergraduates such that P(X≤m)<0.05? This value m (and anything smaller) represents counts that are very unlikely given p=0.30.
2. Now using the hypothesized rate of 15% and your answer to part (a), what is P(X≤m)? This represents how likely this range of counts occurs when p=0.15.
3. If you were to increase the sample size from n=20 to n=100 and repeat the calculations of parts (a) and (b), would you expect the probability in part (b) to generally increase or decrease? Explain your answer.
5.70 Iron depletion without anemia and physical performance. Several studies have shown a link between iron depletion without anemia (IDNA) and physical performance. In one study, the physical performance of 24 female collegiate rowers with IDNA was compared with that of 24 female collegiate rowers with normal iron status.²⁷ Several different measures of physical performance were studied, but we’ll focus here on training-session duration. Assume that training-session duration of female rowers with IDNA is Normally distributed, with mean 58 minutes and standard deviation 11 minutes. Training-session duration of female rowers with normal iron status is Normally distributed, with mean 69 minutes and standard deviation 18 minutes.
1. What is the probability that the mean duration of the 24 rowers with IDNA exceeds 63 minutes?
2. What is the probability that the mean duration of the 24 rowers with normal iron status is less than 63 minutes?
3. What is the probability that the mean duration of the 24 rowers with IDNA is greater than the mean duration of the 24 rowers with normal iron status?
5.71 Treatment and control groups. The previous exercise illustrates a common setting for statistical inference. This exercise gives the general form of the sampling distribution needed in this setting. We have a sample of n observations from a treatment group and an independent sample of m observations from a control group. Suppose that the response to the treatment has the N(μX,σX) distribution and that the response of control subjects has the N(μY,σY) distribution. Inference about the difference μY-μX between the population means is based on the difference y¯-x¯ between the sample means in the two groups.
1. Under the assumptions given, what is the distribution of y¯? Of x¯?
2. What is the distribution of y¯-x¯?

PUTTING IT ALL TOGETHER

5.72 Risks and insurance. The idea of insurance is that we all face risks that are unlikely but carry high cost. Think of a fire destroying your home. So we form a group to share the risk: we all pay a small amount, and the insurance policy pays a large amount to those few of us whose homes burn down. An insurance company looks at the records for millions of homeowners and sees that the mean loss from fire in a year is μ=$600 per house and that the standard deviation of the loss is σ=$12,000. (The distribution of losses is extremely right-skewed: most people have $0 loss, but a few have large losses.) The company plans to sell fire insurance for $500 plus enough to cover its costs and profit.
1. Explain clearly why it would be unwise to sell only 100 policies. Then explain why selling many thousands of such policies is a safe business.
2. Suppose the company sells the policies for $700. If the company sells 50,000 policies, what is the approximate probability that the average loss in a year will be greater than $700?
5.73 Binge drinking. The Centers for Disease Control and Prevention finds that 28% of people aged 18 to 24 years binge drank. Those who binge drank averaged 9.3 drinks per episode and 4.2 episodes per month. The study took a sample of over 18,000 people aged 18 to 24 years, so the population proportion of people who binge drank is very close to p=0.28.²⁸ The administration of your college surveys an SRS of 200 students and finds that 56 binge drink.
1. What is the sample proportion of students at your college who binge drink?
2. If, in fact, the proportion of all students on your campus who binge drink is the same as the national 28%, what is the probability that the proportion in an SRS of 200 students is as large as or larger than the result of the administration’s sample?
3. A writer for the student paper says that the percent of students who binge drink is higher on your campus than nationally. Write a short letter to the editor explaining why the survey does not support this conclusion.
5.74 The ideal number of children. “What do you think is the ideal number of children for a family to have?” A Gallup Poll asked this question of 1020 randomly chosen adults. Roughly 41% thought that a total of three or more children was ideal.²⁹ Suppose that p=0.41 is exactly true for the population of all adults. Gallup announced a margin of error of ±4 percentage points for this poll.
1. What is the probability that the sample proportion p^ for an SRS of size n=1020 falls between 0.37 and 0.45? You see that it is likely, but not certain, that polls like this give results that are correct within their margin of error.
2. What is the probability that a sample proportion p^ falls between 0.37 and 0.45 (that is, within ±4 percentage points of the true p) if the sample is an SRS of size n=300? Of size n=5000?
3. Combine these results to make a general statement about the effect of larger samples in a sample survey.
5.75 Is the ESP result better than guessing? When the ESP study of Exercise 5.63 discovers a subject whose performance appears to be better than guessing, the study continues at greater length. The experimenter looks at many cards bearing one of five shapes (star, square, circle, wave, and cross) in an order determined by random numbers. The subject cannot see the experimenter as the experimenter looks at each card in turn, in order to avoid any possible nonverbal clues. The answers of a subject who does not have ESP should be independent observations, each with probability 1/5 of success. We record 900 attempts.
1. What are the mean and the standard deviation of the count of successes?
2. What are the mean and the standard deviation of the proportion of successes among the 900 attempts?
3. What is the probability that a subject without ESP will be successful in at least 24% of 900 attempts?
4. The researcher considers evidence of ESP to be a proportion of successes so large that there is only probability 0.01 that a subject could do this well or better by guessing. What proportion of successes must a subject have to meet this standard? (Example 1.45, on page 62, shows how to do an inverse calculation for the Normal distribution that is similar to the type required here.)
5.76 How large a sample is needed? The changing probabilities you found in Exercise 5.74 are due to the fact that the standard deviation of the sample proportion p^ gets smaller as the sample size n increases. If the population proportion is p=0.41, how large a sample is needed to reduce the standard deviation of p^ to σp^=0.005? (The 68–95–99.7 rule then says that about 95% of all samples will have p^ within 0.01 of the true p.)