7.3 Sample Size Calculations in Chapter 7 Inference for Means

7.3 Sample Size Calculations

In this section, we focus on a very important issue when planning a study: choosing the sample size. caution A wise user of statistics does not plan for inference without at the same time planning data collection. We describe sample size procedures for both confidence intervals and significance tests. While the actual formulas are a bit technical, only a general understanding of the process is necessary. We can rely on software to do the heavy lifting.

Sample size for confidence intervals

We can arrange to have both high confidence and a small margin of error by choosing an appropriate sample size. Let’s first focus on the one-sample t confidence interval. Its margin of error is

m=t*SEx¯=t*sn

In addition to the confidence level C, which determines t*, and sample size n, this margin of error depends on the sample standard deviation s. Because we don’t know the value of s until we collect the data, we must guess a value to use in our calculations. Because s is our estimate of the population standard deviation σ, this value can also be considered our guess of the population standard deviation.

We’ll call this guessed value sg. To help with this guess, we typically use results from a pilot study or from previous published studies. If no results are available, we then rely on subject-matter expertise to educate our guess. For example, if the expected range of the data were known, we could use

sg=range4

following the 68–95–99.7 rule. It is always better to use a value of the standard deviation that is a little larger than what is expected. This may result in a sample size that is a little larger than needed, but it helps avoid the situation where the resulting margin of error is larger than desired.

Given the desired margin of error m and a guess for s, we can find the sample size by plugging these values into the margin of error formula and solving for n. In fact, we did this in Chapter 6 (page 340). The one complication here is that t* depends not only on the confidence level C but also on the sample size n. Here are the details.

Finding the smallest sample size n that satisfies this requirement can be done using the following iterative search:

Get an initial sample size by replacing t* with z*. Compute n=(z*sg/m)2 and round up to the nearest integer.
Use this sample size to obtain t* and check if m≥t*sg/n.
If the requirement is satisfied, then this n is the needed sample size. If the requirement is not satisfied, increase n by 1 and return to Step 2.

Notice that this method makes no reference to the size of the population. It is the size of the sample that determines the margin of error. The size of the population does not influence the sample size we need as long as the population is much larger than the sample.

Example 7.23 Planning a study of a robot delivery service.

In Example 7.1 (page 387), we calculated a 95% confidence interval for the average delivery time, in minutes, for a delivery robot service to bring you lunch ordered from Cosi. The margin of error based on an SRS of n=8 orders was 12.0 minutes. Suppose that a new study is being planned and the goal is to have a margin of error of 5 minutes. How many deliveries need to be sampled?

The sample standard deviation in Example 7.1 is s=14.317 minutes. To be somewhat conservative, we’ll guess that the population standard deviation sg is 15.0 minutes and use the iterative search method to find n.

For an initial n, we replace t* with z*. This results in

n=(z*s*m)2=[ 1.96(15.0)5 ]2=34.57

Round up to get n=35.
We now check to see if this sample size satisfies the requirement when we switch back to t*. For n=35, we have n−1=34 degrees of freedom and t*=2.032. Using this value, the expected margin of error is

2.032(15.0)/35=5.152

This is larger than m=5, so the requirement is not satisfied.

The following table summarizes these calculations for some larger values of n.

n	ts/n
36	5.075
37	5.001
38	4.930

The requirement is first satisfied when n=38. Thus, we need to sample at least n=38 deliveries for the expected margin of error to be no more than 5 minutes.

Figure 7.20 shows the Minitab input window used to do these calculations. Because the default confidence level is 95%, only the desired margin of error m and the estimate for s need to be entered. The software does the rest once you click OK.

A Minitab input window. — Figure 7.20 Minitab input window used to compute the sample size for a desired margin of error, Example 7.23.

Note that the n=38 refers to the expected margin of error being no more than 5 minutes. caution This does not guarantee that the margin of error for the collected sample will be less than 5 minutes. That is because the sample standard deviation s varies from sample to sample, and these calculations are treating it as a fixed quantity. If you want stronger control of the margin of error, more advanced sample size procedures ask you to also specify the probability of obtaining a margin of error less than the desired value. For the current approach, this probability is roughly 50%. For a probability closer to 100%, the sample size will need to be larger. For example, if we wanted this probability to be roughly 80%, we’d perform these calculations in SAS using the commands

proc power; onesamplemeans CI=t stddev=15.0 halfwidth=5 probwidth=0.80 ntotal=.; run;

The needed sample size increases from n=38 to n=44.

caution Unfortunately, the actual number of usable observations is often less than that planned at the beginning of a study. This is particularly true of data collected in surveys or studies that involve a time commitment from the participants. Careful study designers often assume a nonresponse rate or dropout rate that specifies what proportion of the originally planned sample will fail to provide data. We use this information to calculate the sample size to be used at the start of the study. For example, if a survey were planned needing n=200 respondents and only 40% of those surveyed are expected to respond, we would need to start with a sample size of 200/0.40=500 to obtain usable information from 200.

These sample size calculations also do not account for collection costs. In practice, taking observations costs time and money. There are times when the required sample size may be impossibly expensive. In those situations, one might consider a larger margin of error and/or a lower confidence level to be acceptable.

Check-in

7.21 How do design choices change the sample size? Refer to Example 7.23. For each of the following changes, state whether the needed sample size will increase or decrease and explain your reasoning.
1. The desired margin of error is 7.5 minutes rather than 5 minutes.
2. A 90% rather than 95% confidence level is used.
3. We use sg=20 minutes instead of 15 minutes.
7.22 How many postings to sample? For Check-in question 7.1 (page 386), a random sample of n=16 postings from Zillow.com resulted in a standard deviation of $276. If a new study being planned has sg=280 and the desired 95% margin of error is $100, how many postings need to be sampled?

For the two-sample t confidence interval, the margin of error is

m=t*s12n1+s22n2

A similar type of iterative search can be used to determine the sample sizes n1 and n2, but now we need to guess both standard deviations and decide on an approximation method for the degrees of freedom.

An alternative approach is to consider that the standard deviations and sample sizes are the same, so the margin of error is

m=t*sp2n

and the degrees of freedom are 2(n−1). This is the approach most statistical software take as the problem becomes a variation of the one-sample case.

Example 7.24 Planning a new blood pressure study.

In Example 7.22 (page 426), we calculated a 90% confidence interval for the mean difference in blood pressure. The 90% margin of error was roughly 5.6 mm Hg, which was relatively large. Suppose that a new study is being planned and the desired margin of error at 90% confidence is 2.8 mm Hg. How many subjects per group do we need?

The pooled sample standard deviation in Example 7.22 is 7.385. To be a bit conservative, we’ll guess that the two population standard deviations are both 8.0. We now implement the same iterative search method using the two-sample margin of error formula.

For an initial n, we replace t* with z*. This results in

n=(2z*s*m)2=[ 2(1.645)(8)2.8 ]2=44.2

We round up to get n=45.
The following table summarizes the margin of error for this and some larger values of n:

n	ts2/n
45	2.834
46	2.801
47	2.770

The margin of error is smaller than 2.8 mm Hg when n=47. In SAS, we’d perform these calculations using the commands

proc power; twosamplemeans CI=diff alpha=0.1 stddev=8 halfwidth=2.8 probwidth=0.50 npergroup=.; run;

This sample size is roughly 4.5 times the sample size used in Example 7.22. This researcher may not be able to recruit a sample this large. We should therefore consider alternatives such as a larger desired margin of error.

Check-in

7.23 Would we need a larger sample size? Refer to Example 7.24. For each of the following changes, state whether the needed sample size will increase or decrease and explain your reasoning.
1. The desired margin of error is 3 mm Hg instead of 2.8 mm Hg.
2. A 95% rather than 90% confidence level is used.
3. We use n−1 instead of 2(n−1) for the degrees of freedom.
7.24 Planning a new calcium study. Refer to Example 7.24. What is the required sample size if the goal is have the 95% margin of error no more than 5 mm Hg? Use sg=8.0 and 2(n−1) for the degrees of freedom.

Power of a significance test

The power of a statistical test measures its ability to detect deviations from the null hypothesis. In practice, we carry out a significance test in the hope of showing that the null hypothesis is false. The higher the power, the more likely this will occur. Unfortunately, because of inadequate planning, researchers frequently fail to find evidence for the effects that they believe to be present. This is often the result of an inadequate sample size. Power calculations performed prior to running the experiment help avoid this occurrence and ensure that the sample size is sufficiently large to answer the research question.

Just like the margin of error, the power of a significance test depends on various study-specific factors. The factors and how they impact power are as follows:

Significance level α. A significance test at the 5% level will have a greater chance of rejecting H0 than a test at the 1% level because the strength of evidence required to reject H0 is less.
Population standard deviation(s). More variation means more uncertainty (less precision) about the mean(s). With more uncertainty, we are less likely to reject H0.
Sample size(s). More data will provide more information about the mean(s) (smaller standard error) and thus a greater chance of distinguishing the alternative.
The alternative. The further the alternative is from the null hypothesis, the easier it is to distinguish it from H0 and thus reject.

The margin of error depends on the first three factors. Any change in these factors that increases the power of a significance test also results in a smaller margin of error. This is due to the close relationship between significance tests and confidence intervals.

As for the fourth factor, we usually rely on subject-specific knowledge to choose this value. It is typically the smallest departure from H0 that matters, scientifically or in terms of decision making. For example, if an owner of a store is considering allowing credit card purchases and this is only profitable if the average monthly sales increase by 2%, the owner would not be interested in detecting an increase of just 1%. The owner would instead determine a sample size to detect increases above 2% with high probability.

There are times, however, when the choice of the alternative is not clear-cut. In these situations, we commonly consider the effect size, which is the departure from H0 divided by the population standard deviation. This standardizes the departure and puts it on a common scale. Conventional guidelines consider an effect size of 0.2 to be small, 0.5 to be medium, and 0.8 to be large.⁴²

Figure 7.21 visually displays these two steps for a one- or two-sample t test under the greater-than alternative. The distribution of the test statistic under H0 is used to determine the t* values that lead to rejection. Then the distribution of t* under Ha is used to compute the power. This latter calculation requires a new distribution, the noncentral t distribution. It is not practical to do calculations by hand using this distribution, so we rely on software to do the calculations for us.

Two distribution curves. — Figure 7.21 Visual for computing power under the greater-than alternative. The sampling distributions of the test statistic under both H0 and Ha are used.

Both curves are between negative 4 and 8, with intervals marked in increments of 2. The first curve represents distribution of test statistic under H sub 0. It is right skewed with a maximum at 0. Value alpha is marked under the right tail at approximately 1.9. For values under the curve to the right of alpha, reject H sub 0. For values to the left, fail to reject H sub 0. The second curve represents distribution of test statistic under H sub a. It is nearly identical to the first, but flipped horizontally on the axis so that it is left skewed, with a maximum at 3. Value alpha is marked at approximately 1.9. For values under the curve to the right of alpha, reject H sub 0. For values to the left, fail to reject H sub 0.

We just need to provide the four study-specific factors and let the software do the technical calculations. When considering the effect size but working with software that does not include effect size, set the alternative equal to the effect size and set the population standard deviation equal to 1.

We’ll now run through both one- and two-sample examples.

Example 7.25 Is the sample size large enough?

Recall Example 7.2 (page 389) on the average delivery time from Cosi to your dormitory by a robot delivery service. Your roommate eats lunch at a different time and wants to perform a similar study. She also wants to test if the average delivery time is larger than 15 minutes but additionally wants to be very certain her study will reject H0 if the average is 30 minutes or larger. Does a new study using n=8 deliveries satisfy this requirement?

To answer this, we compute the power of the one-sample t test at the 5% significance level for

H0: μ=15.0Ha: μ>15.0

against the alternative that μ=30 when n=8. This previous sentence provides most of the information we need to compute power. The remaining factor is a guess of σ. Here, we can use the standard deviation results from the previous study. Similar to Example 7.23 (page 434), we will round up and use sg=15 minutes.

Figure 7.22 shows Minitab output for this power calculation. The power when μ=30 is 81.5% and is represented by a dot on the power curve at a difference of means d=30−15=15. This curve is very informative. We see that with a sample size of 8, the power is greater than 90% for differences larger than approximately 17.5 minutes (μ≥32.5 minutes). Your roommate will want to increase the sample size if “very certain” implies a 0.90 or 0.95 probability.

A Minitab output shows a graph of a power curve for a 1 sample t test. — Figure 7.22 Minitab output (a power curve) for the one-sample calculation, Example 7.25.

Check-in

7.25 Power for other values of μ. Using the curve in Figure 7.22, what is the approximate power for the alternative
1. μ=20?
2. μ=35?
3. μ=15?
7.26 How is power affected? Refer to Example 7.25.
1. If your roommate used sg=20 instead of 15, would the power at μ=30 decrease, increase, or stay the same? Explain your answer.
2. If your roommate used α=0.01 instead of α=0.05, would the power at μ=30 decrease, increase, or stay the same? Explain your answer.

For the two-sample power calculation, we consider only the common case where the null hypothesis is “no difference,” μ1−μ2=0. We also consider the calculation for the pooled two-sample t test. A simple modification of what is entered is needed when we do not pool.

Example 7.26 Planning a new study of calcium versus placebo groups.

In Example 7.19 (page 421), we examined the effect of calcium on blood pressure by comparing the means of a treatment group and a placebo group using a pooled two-sample t test. The P-value was 0.059, failing to achieve the usual standard of 0.05 for statistical significance. Suppose that we wanted to plan a new study that would provide convincing evidence—say, at the 0.01 level—with high probability. Let’s examine a study design with 45 subjects in each group (n1=n2=45) to see if this meets our goals when the difference in means μ1−μ2=5. For our guess of σ, we use 7.4, our pooled estimate from Example 7.21.

Figure 7.23 shows the JMP power calculator for the two-sample t test. You input values for α, σ, n1+n2, and μ1−μ2, and it computes the power. The JMP calculator only considers the two-sided alternative, so to get the power for a one-sided alternative, the significance level must be input as 2α. Most other software, such as Minitab, provides the option to choose the alternative.

A J M P input window. — Figure 7.23 JMP input/output window for the two-sample power calculation, Example 7.26.

At the top of the window is an expanded menu, sample size. Below is a power calculator. It lists several measures with textboxes for entering desired values. From top to bottom, it reads as follows. Two means. Testing if two means are different from each other. Alpha, 0.02 entered. Standard deviation, 7.4 entered. Extra parameters, 0 entered. Supply two values to determine the third. Enter one value to see a plot of the other two. Different to detect, 5 entered. Sample size, 90 entered. Power, 0.7965037405 entered. Sample size is the total sample size, per group would be n divided by 2. At the bottom of the window are two buttons for continue and back.

Software Output

The power window in Figure 7.23 shows that the power under these settings in 79.7%. If we judge this probability to be high enough, we can proceed with recruitment. However, given that there is a chance some subjects will dropout or not follow the protocol, we may want to recruit a larger sample, such as n=50.

These two examples focused on determining the power for a given sample size. Calculators such as the one in Figure 7.21 can also be used to determine a sample size for a selected level of power and alternative or the alternative it can detect at a given level of power for a given sample size. This makes these calculators very flexible and informative.

Check-in

7.27 Power and the choice of alternative. If you were to repeat the calculation in Example 7.26 for the two-sided alternative, would the power increase or decrease? Explain your answer.
7.28 Power and the standard deviation. If the true population standard deviation were 8 instead of the 7.4 hypothesized in Example 7.26, would the power increase or decrease? Explain.

Section 7.3 SUMMARY

The sample size required to obtain a confidence interval of a population mean μ with an expected margin of error no larger than m for a population mean satisfies the constraint

m≥t*s*/n
where t* is the critical value for the desired level of confidence with n−1 degrees of freedom, and s* is the guessed value for the population standard deviation. It can be determined by hand or by software.
For a two-sample confidence interval of the difference in means, the necessary sample sizes can be obtained using a similar constraint. Most often it is assumed that the standard deviations and sample sizes are the same.
The power of the one- and two-sample t tests involves the noncentral t distributions. Calculations involving these distributions are not practical by hand but are easy with software.
To compute the power, the researcher must provide the significance level α, sample size(s), guesses of the standard deviation(s), and specific alternative.

Now that you have completed this section, you will be able to:

Use software to compute the sample size needed for a desired margin of error for a mean μ or for a difference in means μ1−μ2. Review Example 7.24 (page 437) and try Exercise 7.67.
Use software to determine the sample size necessary for a one- or two-sample t-test to have adequate power to detect an alternative. Review Example 7.26 (page 440) and try Exercise 7.71.
Use these sample size calculations to determine whether a proposed study should be performed. Review Example 7.26 (page 440) and try Exercise 7.77.

Section 7.3 EXERCISES

7.66 What’s wrong? For each of the following statements, explain what is wrong and why.
1. Doubling the sample size means the margin of error for a population mean μ is exactly cut in half.
2. When testing H0: μ=50 versus the greater-than alternative, the power at μ=53 is larger than at μ=56.
3. Increasing sample size increases the power and decreases the probability of a Type I error.
4. When testing H0: μ=25 versus the two-sided alternative, the power at μ=21 is larger than at μ=29.
7.67 Starting salaries. In a recent survey by the National Association of Colleges and Employers, the average starting salary for college graduates with a computer and information sciences degree was reported to be $81,292.⁴³ You are planning to do a survey of starting salaries for recent computer science majors from your university.
1. Using an estimated standard deviation of $12,100, what sample size do you need to have a margin of error equal to $5000 with 95% confidence?
2. Suppose that, in the setting of part (a), you have the resources to contact 30 recent graduates. If all respond, will your margin of error be larger or smaller than $5000? What if only 80% respond? Verify your answers by performing the calculations.
7.68 Apartment rental rates. You hope to rent an unfurnished one-bedroom apartment in Washington, DC, next year. You call a friend who lives there and ask him to give you an estimate of the mean monthly rate. Having taken a statistics course recently, the friend asks about your desired margin of error and confidence level for this estimate. He also tells you that the standard deviation of monthly rents for one-bedroom apartments is about $640.
1. For 95% confidence and a margin of error of $200, how many apartments should your friend randomly sample from Realtor.com?
2. Suppose that you want the margin of error to be no more than $100. How many apartments should your friend sample?
3. Why is the sample size in part (b) not just four times larger than the sample size in part (a)?
7.69 More on apartment rental rates. Refer to the previous exercise. Will the 95% confidence interval include approximately 95% of the rents of all unfurnished one-bedroom apartments in this area? Explain why or why not.
7.70 Accuracy of a laboratory scale. To assess the accuracy of a laboratory scale, a standard weight known to weigh 10 grams is weighed repeatedly. The scale readings are Normally distributed, with unknown mean. (The mean is 10 grams if the scale has no bias.) The standard deviation of the scale readings in the past has been 0.0013 gram.
1. The weight is measured five times. The mean result is 10.0009 grams. Give a 98% confidence interval for the mean of repeated measurements of the weight.
2. How many measurements must be averaged to get an expected margin of error no more than 0.001 with 98% confidence?
7.71 Accuracy of a laboratory scale, continued. Refer to the previous exercise. Suppose that instead of a confidence interval, the researchers want to perform a test (with α=0.05) that the scale is unbiased (μ=10).
1. What sample size n is necessary to have at least 90% power when the alternative mean is μ=10.001?
2. Suppose the researchers can only perform a maximum of n=10 measurements. Based on your answer in part (a), will the power be more or less than 90%? Explain your answer.
3. Verify your answer in part (b) by computing the power when n=10.
7.72 Sample size calculations. You are designing a study to test the null hypothesis that μ=50 versus the alternative that μ>50. Assume that σ is 25. Suppose that it would be important to be able to detect the alternative μ=54. What sample size is needed to detect this alternative with power of at least 0.80?
7.73 Power of the comparison of DXA machine operators. Suppose that the bone researchers in Exercise 7.29 (page 409) want to be able to detect an alternative mean difference of 0.002. Find the power for this alternative for a sample size of 20 patients. Make sure to explain the reasoning for your choice of standard deviation in these calculations.
7.74 Determining the sample size. In Example 7.25 (page 439), we determined the power of detecting μ=30 minutes when n=8. Suppose your roommate wants the power to be at least 90% when μ=30. What is the minimum sample size needed for this desired power?
7.75 Changing the significance level. In Example 7.26 (page 440), we assessed the power of a new study of calcium on blood pressure, assuming n1=n2=45 subjects. The power was based on α=0.01. Suppose that we wanted to use α=0.05 instead.
1. Would the power increase or decrease? Explain your answer in terms someone unfamiliar with power calculations can understand.
2. Verify your answer by computing the power.
7.76 Planning a study to compare tree size. In Exercise 7.57 (page 432), DBH data for longleaf pine trees in two parts of the Wade Tract are compared. Suppose that you are planning a similar study in which you will measure the diameters of longleaf pine trees. Based on Exercise 7.57, you are willing to assume that the standard deviation for both halves is 20 cm. Suppose that a difference in mean DBH of 10 cm or more would be important to detect. You will use a t statistic and a two-sided alternative for the comparison.
1. Find the power if you randomly sample 20 trees from each area to be compared.
2. Repeat the calculations for 60 trees in each sample.
3. If you had to choose between the 20 and 60 trees per sample, which would you choose? Give reasons for your answer.
7.77 More on planning a study to compare tree size. Refer to the previous exercise. Find the two standard deviations from Exercise 7.57. Do the same for the data in Exercise 7.58, which is a similar setting. These are somewhat smaller than the assumed value that you used in the previous exercise. Explain why it is generally a better idea to assume a standard deviation that is larger than you expect than one that is smaller. Repeat the power calculations for some other reasonable values of σ and comment on the impact of the size of σ for planning the new study.
7.78 Planning a study to compare ad placement. Refer to Exercise 7.56 (page 431), where we compared trustworthiness ratings for ads from two different publications. Suppose that you are planning a similar study using two different publications that are not expected to show the differences seen when comparing the Wall Street Journal with the National Enquirer. You would like to detect a difference of 1.5 points using a two-sided significance test with a 5% level of significance. Based on Exercise 7.56, it is reasonable to use 1.6 as the value of the common standard deviation for planning purposes.
1. What is the power if you use sample sizes similar to those used in the previous study—for example, 65 for each publication?
2. Repeat the calculations for 100 in each group.
3. What sample size would you recommend for the new study?