7.2 Comparing Two Means

A psychologist wants to compare Wisconsin and Indiana college students’ impressions of personality based on selected Facebook pages. A nutritionist is interested in the effect of increased calcium on blood pressure. A bank wants to know which of two incentive plans will most increase the use of its debit card. Two-sample problems such as these are among the most common situations encountered in statistical practice.

A two-sample problem can arise from a randomized comparative experiment that randomly divides the subjects into two groups and exposes each group to a different treatment. A two-sample problem can also arise when comparing random samples separately selected from two populations. Unlike in the matched pairs designs studied earlier, there is no matching of the units in the two samples. In fact, the two samples may be of different sizes. As a result of these differences, inference procedures for two-sample data differ from those for matched pairs.

We can present two-sample data graphically by using a back-to-back stemplot for small samples (page 13) or with side-by-side boxplots for larger samples (page 34). Now we will apply the ideas of formal inference in this setting. When both population distributions are symmetric, and especially when they are at least approximately Normal, a comparison of the mean responses in the two populations is most often the goal of inference.

We have two independent samples, from two distinct populations (such as subjects given the latest Apple iPhone and those given the latest Samsung Galaxy smartphone). The same response variable—say, battery life—is measured for both samples. We will call the variable x1 in the first population and x2 in the second because the variable may have different distributions in the two populations. Here is the notation that we will use to describe the two populations:

Population Variable Mean Standard deviation
1 x1 μ1 σ1
2 x2 μ2 σ2

We want to compare the two population means, either by giving a confidence interval for μ1μ2 or by testing the hypothesis of no difference, H0:μ1=μ2.

Inference is based on two independent SRSs, one from each population. Here is the notation that describes the samples:

Population Sample size Sample Mean Sample standard deviation
1 n1 x¯1 s1
2 n2 x¯2 s2

Throughout this section, the subscripts 1 and 2 show the population to which a parameter or a sample statistic refers.

The two-sample z statistic

The natural estimator of the difference μ1μ2 is the difference between the sample means, x¯1x¯2. If we are to base inference on this statistic, we must know its sampling distribution. Here are some facts from our study of probability:

We now know the sampling distribution of x¯1x¯2 when both populations are Normally distributed. The mean and variance of this Normal distribution can be expressed in terms of the parameters of the two Normal populations.

Here’s an example of a Normal distribution probability calculation using this sampling distribution.

Example 7.13 Heights of 10-year-old girls and boys.

A fourth-grade class has 12 girls and 8 boys. The children’s heights are recorded on their 10th birthdays. What is the chance that the girls are taller than the boys? Of course, it is very unlikely that all the girls are taller than all the boys. We translate the question into the following: What is the probability that the mean height of the girls in this class is greater than the mean height of the boys in this class?

Based on information from the National Health and Nutrition Examination Survey, we assume that the heights (in inches) of 10-year-old girls are N(56.9, 2.8) and the heights of 10-year-old boys are N(56.0, 3.5).23 The heights of the students in our class are assumed to be random samples from these populations. The two distributions are shown in Figure 7.13(a).

Distribution curves of height and height difference.

Figure 7.13 Distributions, Example 7.13. (a) Distributions of heights of 10-year-old boys and girls. (b) Distribution of the difference between mean heights of 12 girls and 8 boys.

The difference x¯1x¯2 between the female and male mean heights varies in different random samples. The sampling distribution has mean

μ1μ2=56.956.0=0.9inch

and variance

σ12n1+σ22n2=2.8212+3.528=2.18

The standard deviation of the difference in sample means is, therefore, 2.18=1.48inches.

If the heights vary Normally, the difference in sample means is also Normally distributed. The distribution of the difference in heights is shown in Figure 7.13(b). We standardize x¯1x¯2 by subtracting its mean (0.9) and dividing by its standard deviation (1.48). Therefore, the probability that the girls in the class, on average, are taller than the boys in the class is

P(x¯1x¯2>0)=P((x¯1x¯2)0.91.48>00.91.48)=P(Z>0.61)=0.7291

Even though the population mean height of 10-year-old girls is greater than the population mean height of 10-year-old boys, the probability that the sample mean of the girls is greater than the sample mean of the boys in our class is only 73%. Large samples are needed to see the effects of small differences.

As Example 7.13 reminds us, any Normal random variable has the N(0, 1) distribution when standardized. We have arrived at a new z statistic.

In the unlikely event that both population standard deviations are known, the two-sample z statistic is the basis for inference about μ1μ2. Exact z procedures are seldom used, however, because σ1 and σ2 are rarely known. In Chapter 6, we discussed the one-sample z procedures in order to introduce the ideas of inference. Here we move directly to the more useful t procedures.

The two-sample t procedures

Data set icon for Vtm.

Suppose now that the population standard deviations σ1 and σ2 are not known. We estimate them by the sample standard deviations s1 and s2 from our two samples. Following the pattern of the one-sample case, we substitute the standard errors for the standard deviations used in the two-sample z statistic. The result is the two-sample t statistic:

t=(x¯1x¯2)(μ1μ2)s12n1+s22n2

Unfortunately, this statistic does not have a t distribution. A t distribution replaces the N(0, 1) distribution only when a single standard deviation is replaced by its estimate. In this case, we replace two standard deviations (σ1 and σ2) by their estimates (s1 and s2).

Nonetheless, we can approximate the distribution of the two-sample t statistic by using the t(k) distribution with an approximation for the degrees of freedom k, also known as a df approximation. We use these approximations to find approximate values of t* for confidence intervals and to find approximate P-values for significance tests. The choice of approximation rarely makes a difference in the conclusion.

Most statistical software uses the Satterthwaite approximation to approximate the t(k) distribution unless the user requests another method. Use of this approximation without software is a bit complicated.24 In general, the resulting k will not be an integer.

If you cannot access software, we recommend using degrees of freedom k equal to the smaller of n11 and n21. This approximation is appealing because it is conservative.25 That is, margins of error for confidence intervals are larger than they need to be, so the true confidence level is larger than C. Likewise, P-values for significance tests will be larger, making it more difficult to reject H0.

The two-sample t confidence interval

We now apply the basic ideas about t procedures to the problem of comparing two means when the standard deviations are unknown. Just as we did in the one-sample case, we start with confidence intervals.

Example 7.14 Directed reading activities assessment.

Data set icon for drp.

An educator believes that new directed reading activities in the classroom will help elementary school pupils improve some aspects of their reading ability. She arranges for a third-grade class of 21 students to take part in these activities for an eight-week period. A control classroom of 23 third-graders follows the same curriculum without the activities. At the end of the eight weeks, all students are given a Degree of Reading Power (DRP) test, which measures the aspects of reading ability that the treatment is designed to improve. The data appear in TABLE 7.4.26

Table 7.4 DRP scores for third-graders

Treatment group Control group
24 61 59 46 42 33 46 37
43 44 52 43 43 41 10 42
58 67 62 57 55 19 17 55
71 49 54 26 54 60 28
43 53 57 62 20 53 48
49 56 33 37 85 42

Prior to inference, we need to check whether the t procedures can be used for these data. The following back-to-back stemplot suggests that there is a mild outlier in the control group but no deviation from Normality serious enough to forbid use of t procedures. Separate Normal quantile plots for both groups (Figure 7.14) confirm that both distributions are approximately Normal.

A back-to-back stemplot and normal quantiles plots of the data.

Figure 7.14 Normal quantile plots of the DRP scores, Example 7.14.

The design of the study in Example 7.14 is not ideal. Random assignment of students was not possible in a school environment, so existing third-grade classes were used. The effect of the reading programs is, therefore, confounded with any other differences between the two classes. That said, the classes were chosen to be as similar as possible—for example, in terms of the social and economic status of the students. Extensive pretesting also showed that the two classes were, on the average, quite similar in reading ability at the beginning of the experiment. To avoid the effect of two different teachers, the researcher herself taught reading in both classes during the eight-week period of the experiment. Therefore, we can be somewhat confident that the two-sample test is detecting the effect of the treatment and not some other difference between the classes. This example is typical of many situations in which an experiment is carried out but randomization is not possible.

Example 7.15 Computing an approximate 95% confidence interval for the difference in means.

Data set icon for drp.

From our examination of the data in Example 7.14, the scores of treatment group appear to be somewhat higher than those of the control. The summary statistics are

Group n x¯ s
Treatment 21 51.48 11.01
Control 23 41.52 17.15

To describe the size of the treatment effect, let’s construct a confidence interval for the difference between the treatment group and the control group means. The interval is

(x¯1x¯2)±t*s12n1+s22n2=(51.4841.52)±t*11.01221+17.15223=9.96±(t*×4.31)

Using software, the degrees of freedom are 37.86 and t*=2.025. This approximation gives

9.96±(2.025×4.31)=9.96±8.73=(1.2,18.7)

The conservative approach would use the smaller of

n11=211=20andn21=231=22

Table D gives t*=2.086.

df=20
t* 1.725 2.086 2.197
C 0.90 0.95 0.96

With this approximation, we have

9.96±(4.31×2.086)=9.96±8.99=(1.0,18.9)

The conservative approach gives a slightly wider interval than the more accurate approximation used by software. However, the difference is very small. We estimate the mean improvement to be about 10 points, with a margin of error of almost 9 points. Unfortunately, the data do not allow a very precise estimate of the size of the average improvement.

Check-in
  1. 7.15 Two-sample t confidence interval. Suppose a study similar to Example 7.14 were performed using two second-grade classes. Assume that the summary statistics are x¯1=46.32, x¯2=32.85, s1=11.53, s2=15.33, n1=26, and n2=24. Find a 95% confidence interval for the difference between the treatment (Group 1) and the control (Group 2) means using the second approximation for degrees of freedom. Also write a one-sentence summary of what this confidence interval says about the difference in means.

  2. 7.16 Smaller sample sizes. Refer to the previous Check-in question. Suppose instead that the two classes are smaller but the summary statistics do not change: x¯1=46.32, x¯2=32.85, s1=11.53, s2=15.33, n1=16, and n2=14. Find a 95% confidence interval for the difference using the second approximation for degrees of freedom. Compare this interval with the one in the previous exercise and discuss the impact smaller sample sizes have on a confidence interval.

The two-sample t significance test

The same ideas that we used for the two-sample t confidence interval also apply to the two-sample t significance test. We can use either software or the conservative approach with Table D to approximate the P-value.

Example 7.16 Is there an improvement?

Data set icon for drp.

For the DRP study described in Example 7.14 (page 414), we hope to show that the treatment (Group 1) performs better than the control (Group 2). For the two-sample t significance test, the same set of hypotheses can be presented as

H0:μ1=μ2Ha:μ1>μ2orH0:μ1μ2=0Ha:μ1μ2>0

The two-sample t statistic is

t=(x¯1x¯2)0s12n1+s22n2=51.4841.5211.01221+17.15223=2.31

The P-value for the one-sided test is P(T2.31). Software gives the approximate P-value of 0.0132 based on 37.86 degrees of freedom.

df=20
p 0.02 0.01
t* 2.197 2.528

Without software, we’d again use 20 degrees of freedom. Comparing 2.31 with the row entries in Table D, we see that P lies between 0.01 and 0.02.

The data strongly suggest that directed reading activity improves the DRP score (t=2.31,df=20,0.01<P<0.02).

Check-in
  1. 7.17 A two-sample t significance test. Refer to Check-in question 7.15. Perform a significance test at the 0.05 level to assess whether the average improvement is five points versus the alternative that it is greater than five points. Write a one-sentence conclusion.

  2. 7.18 Interpreting the confidence interval. Refer to the previous Check-in question. Can the confidence interval in Check-in question 7.15 (page 416) be used to determine whether the significance test of the previous Check-in question rejects or does not reject the null hypothesis? Explain your answer.

Most statistical software requires the raw data for analysis. A few, like Minitab, will also perform a t test on data in summarized form (such as the summary statistics table in Example 7.15). It is always preferable to work with the raw data because one can also examine the data through plots such as the back-to-back stemplot and Normal quantile plots in Example 7.14.

Example 7.17 Using software.

Data set icon for drp.

Figure 7.15 shows JMP and Minitab outputs for the comparison of DRP scores. Both outputs include the 95% confidence interval and the significance test that the means are equal. JMP reports the difference as the mean of treatment minus the mean of control, while Minitab reports the difference in the opposite order.

J M P and Minitab outputs for the D R P score comparison.

Figure 7.15 JMP and Minitab outputs, Example 7.17.

Recall that the confidence interval (treatment minus control) is

(x¯1x¯2)±t*s12n1+s22n2=(51.4841.52)±t*11.01221+17.15223=9.96±(t*×4.31)

From the JMP output, we see that the degrees of freedom under the first approximation are 37.86. Minitab also uses the first degrees of freedom approximation but rounds the degrees of freedom down to the nearest integer (37.937). As a result, the margin of error is slightly wider than that of JMP, and the P-value of the significance test is slightly larger.

The default Minitab output only considers the two-sided alternative. Our test in Example 7.16 is one-sided. If your software gives you the P-value for only the two-sided alternative, 2P(T| t |), you need to divide the reported value by 2 after checking that the means differ in the direction specified by the alternative hypothesis.

Robustness of the two-sample procedures

The two-sample t procedures are more robust than the one-sample t methods. When the sizes of the two samples are equal and the distributions of the two populations being compared have similar shapes, probability values from the t table are quite accurate for a broad range of distributions when the sample sizes are as small as n1=n2=5.27 When the two population distributions have different shapes, larger samples are needed. The guidelines given on page 398 for the use of one-sample t procedures can be adapted to two-sample procedures by replacing “sample size” with the “sum of the sample sizes” n1+n2. Specifically,

These guidelines are rather conservative, especially when the two samples are of equal size. In planning a two-sample study, choose equal sample sizes if you can. The two-sample t procedures are most robust against non-Normality in this case, and the conservative probability values are most accurate.

Here is an example with sample sizes that are almost equal and whose total sample size is more than 40. Even if the distributions are not Normal, we are confident that the sample means will be approximately Normal. The two-sample t procedures are very robust in this case.

Example 7.18 Low-calorie sweeteners and body weight.

Data set icon for lcs.

Low-calorie sweeteners (LCSs) are commonly used as sugar substitutes because they provide sweetness with little or no energy. The unique chemical structure of each LCS, however, may evoke different sensory and behavioral responses that could affect body weight. To study this, a 12-week randomized trial was run to compare 4 LCSs and sugar.28 We will just focus on two of the LCSs (saccharin and sucralose) for this example.

Each day, participants were asked to also consume a sweetened beverage in addition to their normal diet. Here are the summary statistics of the weight change over the 12 weeks, in kilograms (kg):

Group n     x¯ s
Saccharin 25      1.17 2.56
Sucralose 24 0.84 2.84

Those who consumed sucralose lost slightly less than a kilogram, while those who consumed saccharin gained slightly more than a kilogram. Can we conclude that these two groups are not the same? Or is this difference what we could expect to see, given the variation among participants?

The researchers did not specify a direction for the difference. Thus, the hypotheses are

H0:μ1μ2=0Ha:μ1μ20

Figure 7.16 contains the Normal quantile plots for each group. There are no obvious outliers, and the distributions are reasonably Normal. However, given that the sum of the samples is large (n1+n240), we really only need to concern ourselves with outliers. We can confidently use the t procedures.

Two normal quantile plots of weight change.

Figure 7.16 Normal quantile plots, Example 7.18.

The two-sample t statistic is

t=(x¯1x¯2)0s12n1+s22n2=1.17(0.84)2.56225+2.84224=2.60

The conservative approach finds the P-value by comparing 2.60 to critical values for the t(23) distribution because the smaller sample has 24 observations. Using Table D we find 2(0.005)<P<0(0.01). Software using the Satterthwaite approximation gives P=0.012 based on 46.1 degrees of freedom.

df=23
p 0.01 0.005
t* 2.500 2.807

The data give strong evidence that saccharin results in a larger weight change than sucralose (t=2.60,df=46.1,P=0.012). The assertion that the LCS chemical structure evokes different sensory and behavioral responses that affect body weight is supported.

In this and other examples, we can choose which population to label 1 and which to label 2. After inspecting the data, we chose saccharin consumers as Population 1 because this choice makes the t statistic a positive number. This avoids any possible confusion from reporting a negative value for t. caution Choosing the population labels is not the same as choosing a one-sided alternative after looking at the data. Choosing hypotheses after seeing a result in the data is a violation of sound statistical practice.

Inference for small samples

Small samples require special care. We do not have enough observations to examine the distribution shapes, and only extreme outliers stand out. The power of significance tests tends to be low, and the margins of error of confidence intervals tend to be large. Despite these difficulties, we can often draw important conclusions from studies with small sample sizes. If the size of an effect is very large, it should still be evident even if the n’s are small.

Example 7.19 A small study of LCSs and body weight.

Data set icon for lcs1.

In the setting of Example 7.18, let’s consider a much smaller study that collects weight change data from only five participants in each LCS group. Also, given the results of this past example, we choose the one-sided alternative. The data are

Group Weight change (kg)
Saccharin 0.5 3.0      4.2      2.3      3.3
Sucralose 0.1 1.2 1.9 2.3 0.8

First, examine the distributions with a back-to-back stemplot

A back-to-back stemplot for saccharin and sucralose.

While there is variation among weight changes within each group, there is also a noticeable separation. The saccharin group contains four of the five largest weight gains, and the sucralose group contains four of the five smallest losses. A significance test can confirm whether this pattern can arise just by chance or if the saccharin group has a higher mean. We test

H0:μ1=μ2Ha:μ1>μ2

The average weight change is higher in the saccharin group (t=3.81,df=7.99,P=0.0026). The difference in sample means is 3.40 kg.

Figure 7.17 gives outputs for this analysis from four software packages. Although the formats differ, the basic information is the same. All report the sample sizes, the sample means and standard deviations (or variances), the t statistic, and its P-value. All agree that the P-value is small, though some outputs give more detail than others. Software often labels the groups in alphabetical order. Always check the means first and report the statistic (you may need to change the sign) in an appropriate way. Be sure to also mention the size of the effect you observed, such as “The mean weight change for the saccharin group was 3.40 kg higher than for the sucralose group.”

Excel, Minitab, J M P, and S P S S outputs for the weight change comparison.

Figure 7.17 Excel, Minitab, JMP, and SPSS outputs, Example 7.19.

The SPSS output reports the results of two t procedures: the general two-sample procedure that we have just studied and a special procedure which assumes that the two population variances are equal. The “equal-variances” procedures are most helpful in cases like this when the sample sizes n1 and n2 are small and it is reasonable to assume equal variances. When appropriate, these methods result in slightly smaller margins of error and slightly greater power. To understand why this is the case, let’s briefly explore these procedures.

The pooled two-sample t procedures

Suppose that the two Normal population distributions whose means we want to compare have the same standard deviation. How does this additional condition impact our t statistic? Let’s investigate! As we did with the general procedure, we’ll first develop the z statistic and from it the t statistic.

Let’s call the common—but unknown—standard deviation σ. The addition rule for variances says that x¯1x¯2 has variance equal to the sum of the individual variances, which in this case is

σ2n1+σ2n2=σ2(1n1+1n2)

The standardized difference of means is therefore

z=(x¯1x¯2)(μ1μ2)σ1n1+1n2

This is the special two-sample z statistic for the case in which the populations have the same σ.

To get to the t statistic, we replace the unknown σ with its estimate. Because both sample variances s12 and s22 estimate σ2, it would make sense to combine them into a single estimate. It turns out the best way to do this is to average them with weights equal to their degrees of freedom. This gives more weight to the information from the larger sample. The resulting estimator of σ2 is

sp2=(n11)s12+(n21)s22n1+n22

This is called the pooled estimator of σ2 because it combines the information in both samples.

Because we replace a single standard deviation μ by its estimate sp, the resulting t statistic has a t distribution. The degrees of freedom are n1+n22, the sum of the degrees of freedom of the two sample variances. These degrees of freedom are always at least as large as the degrees of freedom for the general two-sample t procedures. The larger degrees of freedom and the pooled estimate of variance are the reasons these procedures are helpful. However, to get these gains, we have an additional condition of a common standard deviation.

Example 7.20 Calcium and blood pressure.

Data set icon for bp_ca.

Does increasing the amount of calcium in our diet reduce blood pressure? Examination of a large sample of people revealed a relationship between calcium intake and blood pressure, but such observational studies do not establish causation. Animal experiments, however, showed that calcium supplements do reduce blood pressure in rats, justifying an experiment with human subjects. A randomized comparative experiment gave one group of 10 black men a calcium supplement for 12 weeks. The control group of 11 black men received a placebo that appeared identical. (In fact, a block design with black and white men as the blocks was used. We will look only at the results for the black men because an earlier survey suggested that calcium is more effective for blacks.) The experiment was double-blind. TABLE 7.5 gives the seated systolic (heart contracted) blood pressure for all subjects at the beginning and end of the 12-week period, in millimeters of mercury (mm Hg). Because the researchers were interested in decreasing blood pressure, Table 7.5 also shows the decrease for each subject. An increase appears as a negative entry.29

Table 7.5 Seated systolic blood pressure (mm Hg)

Calcium group Placebo group
Begin End Decrease Begin End Decrease
107 100    7 123 124 1
110 114 4 109  97   12
123 105   18 112 113 1
129 112   17 102 105 3
112 115 3  98  95   3
111 116 5 114 119 5
107 106     1 119 114     5
112 102   10 114 112     2
136 125   11 110 121 11
102 104 2 117 118   1
130 133   3

As usual, we first examine the data. To compare the effects of the two treatments, take the response variable to be the amount of the decrease in blood pressure. Inspection of the data reveals that there are no outliers. Side-by-side boxplots and Normal quantile plots (Figures 7.18 and 7.19) give a more detailed picture. The calcium group has a somewhat short left tail, but there are no severe departures from Normality that will prevent use of t procedures.

Two box plots compare blood pressure decrease for calcium and a placebo.

Figure 7.18 Side-by-side boxplots of the decrease in blood pressure from Table 7.5.

Two normal quantile plots of weight change.

Figure 7.19 Normal quantile plots of the change in blood pressure from Table 7.5.

To examine the question of the researchers who collected these data, we perform a significance test.

Example 7.21 Does increased calcium reduce blood pressure?

Data set icon for bp_ca.

Take Group 1 to be the calcium group and Group 2 to be the placebo group. The evidence that calcium lowers blood pressure more than a placebo is assessed by testing

H0:μ1=μ2Ha:μ1>μ2

Here are the summary statistics for the decrease in blood pressure:

Group Treatment n x¯ s
1 Calcium 10     5.000 8.743
2 Placebo 11 0.273 5.901

The calcium group shows a drop in blood pressure, and the placebo group has a small increase. The sample standard deviations do not rule out equal population standard deviations. A difference this large will often arise by chance in samples this small. We are willing to assume equal population standard deviations. The pooled sample variance is

sp2=(n11)s12+(n21)s22n1+n22=(101)8.7432+(111)5.901210+112=54.536

so that

sp=54.536=7.385

The pooled two-sample t statistic is

t=(x¯1x¯2)0sp1n1+1n2=5.000(0.273)7.385110+111=5.2733.227=1.634

The P-value is P(T1.634), where T has the t(19) distribution. From Table D, we can see that P falls between the α=0.10 and α=0.05 levels. Statistical software gives the exact value P=0.059. The experiment found evidence that calcium reduces blood pressure, but the evidence falls a bit short of the traditional 5% and 1% levels.

df=19
p 0.10 0.05
t* 1.328 1.729

Sample size strongly influences the P-value of a test. An effect that fails to be significant at a specified level α in a small sample can be significant in a larger sample. In the light of the rather small samples in Example 7.20, the evidence for some effect of calcium on blood pressure is rather good. The published account of the study combined these results for blacks with the results for whites and adjusted for pretest differences among the subjects. Using this more detailed analysis, the researchers were able to report a P-value of 0.008.

Of course, a P-value is almost never the last part of a statistical analysis. To make a judgment regarding the size of the effect of calcium on blood pressure, we need a confidence interval.

Example 7.22 How different are the calcium and placebo groups?

Data set icon for bp_ca.

We estimate that the effect of calcium supplementation is the difference between the sample means of the calcium and the placebo groups, x¯1x¯2=5.273 mm Hg. A 90% confidence interval for μ1μ2 uses the critical value t*=1.729 from the t(19) distribution. The interval is

(x¯1x¯2)±t*sp1n1+1n2=[ 5.000(0.273) ]±(1.729)(7.385)110+111=5.273±5.579

We are 90% confident that the difference in means is in the interval (0.306,10.852). The calcium treatment reduced blood pressure by about 5.3 mm Hg more than a placebo on the average, but the margin of error of this estimate is 5.6 mm Hg.

The pooled two-sample t procedures are anchored in statistical theory and so have long been the standard version of the two-sample t in textbooks. caution These procedures, however, require the condition that the two unknown population standard deviations are equal. This condition is hard to verify.

The pooled t procedures are, therefore, a bit risky. They are reasonably robust against both non-Normality and unequal standard deviations when the sample sizes are nearly the same. When the samples are quite different in size, the pooled t procedures become sensitive to unequal standard deviations and should be used with caution unless the samples are large. Unequal standard deviations are very common. In particular, it is not unusual for the spread of data to increase when the center of the data increases. We recommend regular use of the general t procedures, particularly when software automates the Satterthwaite approximation.

Check-in
  1. 7.19 Using software. Figure 7.17 (page 422) gives the outputs from four software packages for comparing the weight change of two groups consuming different low-calorie sweeteners. For the general two-sample t test, all software use the Satterthwaite approximation for degrees of freedom, but some round down or to the nearest integer. Summarize what each software does and provide its P-value.

  2. 7.20 Let’s consider the pooled t test. Example 7.18 (page 419) gives summary statistics for the weight change in two low-calorie sweetener groups. The two sample standard deviations are relatively close, so we may be willing to assume equal population standard deviations. Calculate the pooled t statistic and its degrees of freedom from the summary statistics. Use Table D to assess significance. How do your results compare with the unpooled analysis for these data?

Section 7.2 SUMMARY

  • Significance tests and confidence intervals for the difference between the means μ1 and μ2 of two Normal populations are based on the difference x¯1x¯2 between the sample means from two independent SRSs. Because of the central limit theorem, the resulting procedures are approximately correct for other population distributions when the sample sizes are large.

  • When independent SRSs of sizes n1 and n2 are drawn from two Normal populations with parameters μ1, σ1 and μ2, σ2 the two-sample z statistic

    z=(x¯1x¯2)(μ1μ2)σ12n1+σ22n2

    has the N(0, 1) distribution.

  • The two-sample t statistic

    t=(x¯1x¯2)(μ1μ2)s12n1+s22n2

    does not have a t distribution. However, software can give accurate P-values using the Satterthwaite approximation.

  • Conservative inference procedures for comparing μ1 and μ2 are obtained from the two-sample t statistic by using the t(k) distribution with degrees of freedom k equal to the smaller of n11 and n21. Use this method only when you are not using software.

  • An approximate level C confidence interval for μ1μ2 is given by

    (x¯1x¯2)±t*s12n1+s22n2

    Here, t* is the value for the t(k) density curve with area C between t* and t*, where k is computed from the data by software or is the smaller of n11 and n21.

  • Significance tests for H0:μ1μ2=Δ0 use the two-sample t statistic

    t=(x¯1x¯2)Δ0s12n1+s22n2

    The P-value is approximated using the t(k) distribution, where k is determined from the data by software or is the smaller of n11 and n21.

  • The guidelines for practical use of two-sample t procedures are similar to those for one-sample t procedures. Equal sample sizes are recommended.

  • If we can consider that the two populations have equal variances, the pooled two-sample t procedures can be used. These are based on the pooled estimator of σ2

    sp2=(n1-1)s12+(n21)s22n1+n2-2

    and the t(n1+n22) distribution. We do not recommend this procedure for regular use.

Section 7.2 EXERCISES

In these two-sample t problems, try to use the degrees of freedom approximation provided by software. For exercises involving summarized data, this approximation is provided for you. If you instead use the conservative approximation, the smaller of n11 and n21, be sure to clearly state this.

  1. 7.35 What’s wrong? For each of the following statements, explain what is wrong and why.

    1. A researcher wants to test H0:x¯1=x¯2 versus the two-sided alternative Ha:x¯1x¯2.

    2. A study recorded the IQ scores of 100 college freshmen. The scores of the 56 males in the study were compared with the scores of all 100 freshmen using the two-sample methods of this section.

    3. A two-sample t statistic gave a P-value of 0.94. From this, we can reject the null hypothesis with 90% confidence.

    4. A researcher is interested in testing the one-sided alternative Ha:μ1<μ2. The significance test gave t=2.15. Because the P-value for the two-sided alternative is 0.036, he concluded that his P-value was 0.018.

  2. 7.36 Basic concepts. For each of the following, answer the question and give a short explanation of your reasoning.

    1. A 95% confidence interval for the difference between two means is reported as (0.8, 2.3). What can you conclude about the results of a significance test of the null hypothesis that the population means are equal versus the two-sided alternative?

    2. Will larger samples generally give a larger or smaller margin of error for the difference between two sample means?

  3. 7.37 More basic concepts. For each of the following, answer the question and give a short explanation of your reasoning.

    1. A significance test for comparing two means gave t=1.97 with 10 degrees of freedom. Can you reject the null hypothesis that the μ’s are equal versus the two-sided alternative at the 5% significance level?

    2. Answer part (a) for the one-sided alternative that the difference between means is negative.

  4. 7.38 Physical demands of women’s rugby sevens matches. Rugby sevens is rapidly growing in popularity and became an Olympic sport in 2016. Matches are played on a full rugby field and consist of two seven-minute halves. Each team also consists of seven players. To better understand the demands of women’s rugby sevens, a group of researchers compared the physical qualities of elite players from the Canadian National team with a university squad. The following table summarizes some of these qualities:30

    Quality Elite (n=16) University (n=13)
    x¯ s x¯ s
    Sprint speed (km/hr) 27.3 0.7 26.0 1.5
    Peak heart rate (bpm) 192.0 6.0 193.0 6.0
    Intermittent recovery test (m) 1160 191 781 129

    Carry out the significance tests using α=0.05. Report the test statistics and P-values and write a short summary of your conclusion. (Software gives k=16.2,25.8, and 26.2, respectively.)

  5. 7.39 Noise levels in fitness classes. Fitness classes often have very loud music that could affect hearing. One study collected noise levels (decibels) in both high-intensity and low-intensity fitness classes across eight commercial gyms in Sydney, Australia.31 Data set icon for noise.

    1. Create a histogram or Normal quantile plot for the high-intensity classes. Do the same for the low-intensity classes. Are the distributions reasonably Normal? Summarize the distributions in words.

    2. Test the equality of means using a two-sided alternative hypothesis and significance level α=0.05.

    3. Are the t procedures appropriate, given your observations in part (a)? Explain your answer.

    4. Remove the one low decibel reading for the low-intensity group and redo the significance test. How does this outlier affect the results?

    5. Do you think the results of the significance test from part (b) or (d) should be reported? Explain your answer.

  6. 7.40 Noise levels in fitness classes, continued. Refer to the previous exercise. In most countries, the workplace noise standard is 85 db (over eight hours). For every 3 dB increase above that, the amount of exposure time is halved. This means that the exposure time for a dB level of 91 is two hours and for a dB level of 94 it is one hour. Data set icon for noise.

    1. Construct a 95% confidence interval for the mean dB level in high-intensity classes.

    2. Using the interval in part (a), construct a 95% confidence interval for the number of one-hour classes per day an instructor can teach before possibly risking hearing loss. (Hint: This is a linear transformation.)

    3. Repeat parts (a) and (b) for low-intensity classes.

    4. Explain how one might use these intervals to determine the staff size of a new gym.

  7. 7.41 When is 30/31 days not equal to a month? Time can be expressed on different levels of scale, including days, weeks, months, and years. Can the scale provided influence perception of time? For example, if you placed an order over the phone, would it make a difference if you were told the package would arrive in four weeks or in one month? To investigate this, two researchers asked a group of 267 college students to imagine that their car needed major repairs and would have to stay at the shop. Depending on the group they were randomized to, the student was either told it would take 1 month or 30/31 days. Each student was then asked to give best- and worst-case estimates of when the car would be ready. The interval between these two estimates (in days) was the response. Here are the results:32

    Group n x¯ s
    30/31 days 177 20.4 14.3
    1 month 90 24.8 13.9
    1. Given that the interval cannot be less than 0, the distributions are likely skewed. Comment on the appropriateness of using the t procedures.

    2. Test that the average interval is the same for the two groups using the α=0.05 significance level. Report the test statistic, P-value and give a short summary of your conclusion (software gives k=183.7).

  8. 7.42 When is 52 weeks not equal to a year? Refer to the previous exercise. The researchers also had 60 marketing students read an announcement about a construction project. The expected duration was either 1 year or 52 weeks. Each student was then asked to state the earliest and latest completion dates.

    Group n x¯ s
    52 weeks 30 84.1 55.8
    1 year 30 139.6 73.1

    Test that the average interval is the same for the two groups, using the α=0.05 significance level. Report the test statistic and P-value and give a short summary of your conclusion. (Software gives k=54.2.)

  9. 7.43 Trustworthiness and eye color. Why do we naturally tend to trust some strangers more than others? One group of researchers decided to study the relationship between eye color and trustworthiness.33 In their experiment, the researchers took photographs of 80 students (20 males with brown eyes, 20 males with blue eyes, 20 females with brown eyes, and 20 females with blue eyes), each seated in front of a white background looking directly at the camera with a neutral expression. These photos were cropped so the eyes were horizontal and at the same height in the photo and so the neckline was visible. The researchers then recruited 105 participants to judge the trustworthiness of each student photo. This was done using a 10-point scale, where 1 meant very untrustworthy and 10 very trustworthy. The 80 scores from each participant were then converted to z-scores, and the average z-score of each photo (across all 105 participants) was used for the analysis. Here is a summary of the results:

    Eye color n     x¯ s
    Brown 40      0.55 1.68
    Blue 40 0.38 1.53

    Can we conclude from these data that brown-eyed students appear more trustworthy compared to their blue-eyed counterparts? Test the hypothesis that the average scores for the two groups are the same. (Software gives k=73.3.)

  10. 7.44 Facebook use in college. Because of Facebook’s popularity among college students, there is a great deal of interest in the relationship between Facebook use and academic performance. One study collected information on n=1839 undergraduate students to look at the relationships among frequency of Facebook use, participation in Facebook activities, time spent preparing for class, and overall GPA.34

    Students reported preparing for class an average of 706 minutes per week, with a standard deviation of 526 minutes. Students also reported spending an average of 106 minutes per day on Facebook, with a standard deviation of 93 minutes; 8% of the students reported spending no time on Facebook.

    1. Construct a 95% confidence interval for the average number of minutes per week a student prepares for class.

    2. Construct a 95% confidence interval for the average number of minutes per week a student spends on Facebook. (Hint: Be sure to convert from minutes per day to minutes per week.)

    3. Explain why you might expect the population distributions of these two variables to be highly skewed to the right. Do you think this fact makes your confidence intervals invalid? Explain your answer.

  11. 7.45 Possible biases? Refer to the previous exercise. The researcher surveyed students at a four-year public university in the northeastern United States (N=3866). Each student was emailed a link to the survey hosted on SurveyMonkey.com. The researcher stated:

    For the students who did not participate immediately, two additional reminders were sent, one week apart. Participants were offered a chance to enter a drawing to win one of 90 $10 Amazon.com gift cards as incentive. A total of 1839 surveys were completed for an overall response rate of 48%.

    Discuss how these factors influence your interpretation of the results of this survey.

  12. 7.46 Comparing means. Refer to Exercise 7.44. Suppose that you wanted to compare the average minutes per week spent on Facebook with the average minutes per week spent preparing for class.

    1. Provide an estimate of this difference.

    2. Explain why it is incorrect to use the two-sample t test to see if the means differ.

  13. 7.47 Sadness and spending. The “misery is not miserly” phenomenon refers to a person’s spending judgment going haywire when the person is sad. In a study, 31 young adults were given $10 and randomly assigned to either a sad group or a neutral group. The participants in the sad group watched a video about the death of a boy’s mentor (from The Champ), and those in the neutral group watched a video on the Great Barrier Reef. After the video, each participant was offered the chance to trade $0.50 increments of the $10 for an insulated water bottle.35 Here are the data: Data set icon for sadness.

    Group Purchase price ($)
    Neutral 0.00 2.00 0.00 1.00 0.50 0.00 0.50
    2.00 1.00 0.00 0.00 0.00 0.00 1.00
    Sad 3.00 4.00 0.50 1.00 2.50 2.00 1.50 0.00 1.00
    1.50 1.50 2.50 4.00 3.00 3.50 1.00 3.50
    1. Examine each group’s prices graphically. Is use of the t procedures appropriate for these data? Carefully explain your answer.

    2. Make a table with the sample size, mean, and standard deviation for each of the two groups.

    3. State appropriate null and alternative hypotheses for comparing these two groups.

    4. Perform the significance test at the α=0.05 level, making sure to report the test statistic, degrees of freedom, and P-value. What is your conclusion?

    5. Construct a 95% confidence interval for the mean difference in purchase price between the two groups.

  14. 7.48 Diet and mood. Researchers were interested in comparing the long-term psychological effects of being on a high-carbohydrate, low-fat (LF) diet versus a high-fat, low-carbohydrate (LC) diet.36 A total of 106 overweight and obese participants were randomly assigned to one of these two energy-restricted diets. At 52 weeks, 32 LC dieters and 33 LF dieters remained. Mood was assessed using a total mood disturbance score (TMDS), where a lower score is associated with a less negative mood. A summary of these results follows:

    Group n x¯ s
    LC 32 47.3 28.3
    LF 33 19.3 25.8
    1. Is there a difference in the TMDS at Week 52? Test the null hypothesis that the dieters’ average mood in the two groups is the same. Use a significance level of 0.05. (Software gives k=62.1.)

    2. Critics of this study focus on the specific LC diet (that it, the science) and the dropout rate. Explain why the dropout rate is important to consider when drawing conclusions from this study.

  15. 7.49 Drive-thru customer service. QSRMagazine.com assessed 1503 drive-thru visits at quick-service restaurants.37 One benchmark assessed was customer service. Responses ranged from “Rude (1)” to “Very Friendly (5).” The following table breaks down the responses according to two of the chains studied. Data set icon for drvthru.

    Chain Rating
    1 2 3 4 5
    Taco Bell 3 13 25 53 71
    Chick-fil-A 1 0 12 51 119
    1. A researcher decides to compare the average ratings of Taco Bell and Chick-fil-A. Comment on the appropriateness of using the numerical average ratings for these data.

    2. Assuming that an average of these ratings makes sense, comment on the use of the t procedures for these data.

    3. Report the means and standard deviations of the ratings for each chain separately.

    4. Test whether the two chains, on average, have the same customer satisfaction. Use a two-sided alternative hypothesis and a significance level of 5%.

  16. 7.50 Comparison of two web page designs. You want to compare the daily number of hits for two different website designs for your indie rock band. You assign the next 30 days to either Design A or Design B, 15 days to each.

    1. Would you use a one-sided or a two-sided significance test for this problem? Explain your choice.

    2. If you use Table D to find the critical value, what are the degrees of freedom using the second approximation?

    3. If you perform the significance test using α=0.05, how large (positive or negative) must the t statistic be to reject the null hypothesis that the two designs result in the same average hits?

  17. 7.51 Change in portion size. A study of food portion sizes reported that over a 17-year period, the average size of a soft drink consumed by Americans aged two years and older increased from 13.1 ounces (oz) to 19.9 oz. The authors state that the difference is statistically significant with P<0.01.38 Explain what additional information you would need to compute a confidence interval for the increase and outline the procedure you would use for the computations. Do you think that a confidence interval would provide useful additional information? Explain why or why not.

  18. 7.52 Beverage consumption. The results in the previous exercise were based on two national surveys with a very large number of individuals. Another study also looked at beverage consumption, but the sample sizes were much smaller. One part of this study compared 20 children who were 7 to 10 years old with 5 children who were 11 to 13.39 The younger children consumed an average of 8.2 oz of sweetened drinks per day, and the older ones averaged 14.5 oz. The standard deviations were 10.7 oz and 8.2 oz, respectively.

    1. Do you think that it is reasonable to assume that these data are Normally distributed? Explain why or why not. (Hint: Think about the 68–95–99.7 rule.)

    2. Using the methods in this section, test the null hypothesis that the two groups of children consume equal amounts of sweetened drinks versus the two-sided alternative. Report all details of the significance-testing procedure, along with your conclusion.

    3. Give a 95% confidence interval for the difference in means.

    4. Do you think that the analyses performed in parts (b) and (c) are appropriate for these data? Explain why or why not.

    5. The children in this study were all participants in an intervention study at the Cornell Summer Day Camp at Cornell University. To what extent do you think that these results apply to other groups of children?

  19. 7.53 Study design is important! Recall Exercise 7.50. You are concerned that day of the week may affect the number of hits. So to compare the two web page designs, you choose two successive weeks in the middle of a month. You flip a coin to assign one Monday to the first design and the other Monday to the second. You repeat this for each of the seven days of the week. You now have seven hit amounts for each design. It is incorrect to use the two-sample t test to see if the mean hits differ for the two designs. Carefully explain why.

  20. 7.54 New hybrid tablet and laptop? The purchasing department has suggested that your company switch to a new hybrid tablet and laptop. As CEO, you want data indicating that employees will like these new hybrids over the old laptops. You designate the next 16 employees needing new laptops to participate in an experiment in which 8 will be randomly assigned to receive the standard laptop, and the remainder will receive the new hybrid tablet and laptop. After a month of use, these employees will express their satisfaction with their new computers by responding to the statement “I like my new computer” on a scale from 1 to 5, where 1 represents “strongly disagree,” 2 is “disagree,” 3 is “neutral,” 4 is “agree,” and 5 is “strongly agree.”

    1. The employees with the hybrid computers have an average satisfaction score of 4.2, with standard deviation 0.9. The employees with the standard laptops have an average of 3.8, with standard deviation 1.3. Give a 95% confidence interval for the difference in the mean satisfaction scores for all employees.

    2. Would you reject the null hypothesis that the mean satisfaction for the two types of computers is the same versus the two-sided alternative at significance level 0.05? Use your confidence interval to answer this question. Explain why you do not need to calculate the test statistic.

  21. 7.55 Why randomize? Refer to the previous exercise. A coworker suggested that you give the new hybrid computers to the next eight employees who need new computers and the standard laptop to the following eight. Explain why your randomized design is better.

  22. 7.56 Does ad placement matter? Corporate advertising tries to enhance the image of the corporation. A study compared two ads from two sources, the Wall Street Journal and the National Enquirer. Subjects were asked to pretend that their company was considering a major investment in Performax, the fictitious sportswear firm in the ads. Each subject was asked to respond to the question “How trustworthy was the source in the sportswear company ad for Performax?” on a seven-point scale. Higher values indicated more trustworthiness.40 Here is a summary of the results:

    Ad source n x¯ s
    Wall Street Journal 66 4.77 1.50
    National Enquirer 61 2.43 1.64
    1. Compare the two sources of ads using a t test. Be sure to state your null and alternative hypotheses, the test statistic, the P-value, and your conclusion. (Software gives k=121.6.)

    2. Give a 95% confidence interval for the difference.

    3. Write a short paragraph summarizing the results of your analyses.

  23. 7.57 Size of trees in the northern and southern halves. The study of 584 longleaf pine trees in the Wade Tract in Thomas County, Georgia, had several purposes. Are trees in one part of the tract more or less like trees in any other part of the tract, or are there differences? In Example 6.1 (page 329), we examined how the trees are distributed in the tract and found that the pattern is not random. In this exercise, we will examine the sizes of the trees. In Exercise 7.19 (page 407), we analyzed the sizes, measured as diameter at breast height (DBH), for a random sample of 40 trees. Here, we divide the tract into northern and southern halves and take random samples of 30 trees from each half. Here are the diameters in centimeters (cm) of the sampled trees: Data set icon for nspines.

    North 27.8 14.5 39.1 3.2 58.8 55.5 25.0 5.4 19.0 30.6
    15.1 3.6 28.4 15.0 2.2 14.2 44.2 25.7 11.2 46.8
    36.9 54.1 10.2 2.5 13.8 43.5 13.8 39.7 6.4 4.8
    South 44.4 26.1 50.4 23.3 39.5 51.0 48.1 47.2 40.3 37.4
    36.8 21.7 35.7 32.0 40.4 12.8 5.6 44.3 52.9 38.0
    2.6 44.6 45.5 29.1 18.7 7.0 43.8 28.3 36.9 51.6
    1. Use a back-to-back stemplot and side-by-side boxplots to examine the data graphically. Describe the patterns in the data.

    2. Is it appropriate to use the two-sample t procedures to compare the mean DBH of the trees in the north half of the tract with the mean DBH of the trees in the south half? Give reasons for your answer.

    3. What are appropriate null and alternative hypotheses for comparing the two samples of tree DBHs? Give reasons for your choices.

    4. Perform the significance test. Report the test statistic, the degrees of freedom, and the P-value. Summarize your conclusion.

    5. Find a 95% confidence interval for the difference in mean DBHs in terms of an estimate and its margin of error. Explain how this interval provides additional information about this problem.

  24. 7.58 Size of trees in the eastern and western halves. Refer to the previous exercise. The Wade Tract can also be divided into eastern and western halves. Here are the DBHs of 30 randomly selected longleaf pine trees from each half: Data set icon for ewpines.

    East 23.5 43.5 6.6 11.5 17.2 38.7 2.3 31.5 10.5 23.7
    13.8 5.2 31.5 22.1 6.7 2.6 6.3 51.1 5.4 9.0
    43.0 8.7 22.8 2.9 22.3 43.8 48.1 46.5 39.8 10.9
    West 17.2 44.6 44.1 35.5 51.0 21.6 44.1 11.2 36.0 42.1
    3.2 25.5 36.5 39.0 25.9 20.8 3.2 57.7 43.3 58.0
    21.7 35.6 30.9 40.6 30.7 35.6 18.2 2.9 20.4 11.4

    Using the questions in the previous exercise, analyze these data.

  25. 7.59 Sales of a small appliance across months. A market research firm supplies manufacturers with estimates of the retail sales of their products from samples of retail stores. Marketing managers are prone to look at the estimate and ignore sampling error. Suppose that an SRS of 55 stores this month shows mean sales of 53 units of a small appliance, with standard deviation 12 units. During the same month last year, an SRS of 63 stores gave mean sales of 50 units, with standard deviation 11 units. An increase from 50 to 53 is a rise of 6%. The marketing manager is happy because sales are up 6%.

    1. Use the two-sample t procedure to give a 95% confidence interval for the difference in mean number of units sold at all retail stores.

    2. Explain in language that the manager can understand why he cannot be certain that sales rose by 6% and that, in fact, sales may even have dropped.

  26. 7.60 An improper significance test. A friend has performed a significance test of the null hypothesis that two means are equal. His report states that the null hypothesis is rejected in favor of the alternative that the first mean is larger than the second. In a presentation on his work, he notes that the first sample mean was larger than the second mean, and this is why he chose this particular one-sided alternative.

    1. Explain what is wrong with your friend’s procedure and why.

    2. Suppose that your friend reported t=1.71 with a P-value of 0.046. What is the correct P-value that he should report?

  27. 7.61 Breast-feeding versus baby formula. A study of iron deficiency among infants compared samples of infants following different feeding regimens. One group contained breast-fed infants, and the infants in another group were fed a standard baby formula without any iron supplements. Here are summary results on blood hemoglobin levels at 12 months of age:41

    Group n x¯ s
    Breast-fed 23 13.3 1.7
    Formula 19 12.4 1.8
    1. Is there significant evidence that the mean hemoglobin level is higher among breast-fed babies? State H0 and Ha and carry out a t test. Give the P-value and state your conclusion (software gives k=37.6).

    2. Give a 95% confidence interval for the mean difference in hemoglobin level between the two populations of infants.

    3. State the assumptions that your procedures in parts (a) and (b) require in order to be valid.

  28. 7.62 Revisiting the sadness and spending study. In Exercise 7.47 (page 430), the purchase price of a water bottle was analyzed using the two-sample t procedure that does not assume equal standard deviations. Compare the means using a significance test and find the 95% confidence interval for the difference using the pooled method. How do the results compare with those you obtained in Exercise 7.47? Data set icon for sadness.

  29. 7.63 Revisiting the diet and mood study. In Exercise 7.48 (page 430), the total mood disturbance score means were compared using the two-sample t procedures that do not assume equal standard deviations. Compare the means using a significance test and find the 95% confidence interval for the difference using the pooled methods. How do the results compare with those you obtained in Exercise 7.48?

  30. NAEP 7.64 Two-sample test of equivalence. In Section 7.1 (page 396), we were introduced to the one-sample test of equivalence. Using the same ideas, describe how to perform a two-sample test of equivalence.

  31. NAEP 7.65 Revisiting the small-sample example. Refer to Example 7.19 (page 421). This is a case where the sample sizes are quite small. With only five observations per group, we have very little information to make a judgment about whether the population standard deviations are equal. The potential gain from pooling is large when the sample sizes are small. Assume that we will perform a two-sided test using the 5% significance level. Data set icon for lcs1.

    1. Find the critical value for the unpooled t test statistic that does not assume equal variances. Use the minimum of n11 and n21 for the degrees of freedom.

    2. Find the critical value for the pooled t test statistic.

    3. How does comparing these critical values show an advantage of the pooled test?