12.2 Comparing the Means in Chapter 12 One-Way Analysis of Variance

12.2 Comparing the Means

When you complete this section, you will be able to:

Distinguish between the use of contrasts to examine particular versions of the alternative hypothesis and the use of a multiple-comparisons method to compare pairs of means.
Construct a confidence interval or perform a t significance test for a contrast and summarize the results.
Describe the use of a multiple-comparisons method in terms of controlling false rejections.
Interpret software output to draw conclusions regarding differences in means.
Use software to determine the power of the ANOVA F test for a given set of group means and sample size n.

The ANOVA F test gives a general answer to a general question: Are the differences among observed group means statistically significant? Unfortunately, a small P-value simply tells us that the group means are not all the same. It does not tell us specifically which means differ from each other. Plotting and inspecting the means give us some indication of where the differences lie, but we would like to supplement inspection with formal inference. This section presents two approaches to the task of comparing group means. It concludes with discussion on the power of the one-way ANOVA F test.

Contrasts

In the ideal situation, specific questions regarding comparisons among the means are posed before the data are collected. We can answer specific questions of this kind and attach a level of confidence to the answers we give. We now explore these ideas through a Facebook study.

Example 12.18 How do users spend their time on Facebook?

Data set icon for facetym.

An online study was designed to compare the amount of time a Facebook user devotes to reading positive, negative, and neutral Facebook profiles. Each participant was randomly assigned to one of five Facebook profile groups:

Positive female
Positive male
Negative female
Negative male
Gender neutral with neutral content

Each participant was provided an email link to a survey on Survey Monkey. As part of the survey, the participant was directed to view the assigned Facebook profile page and then answer some questions. The amount of time (in minutes) the participant spent viewing the profile prior to answering the questions was recorded as the response.¹⁰

We should always begin our analysis with a check of the data. Time-to-event data (here, the time until the participant begins to answer the survey questions) is often skewed to the right. Preliminary analysis of the residuals (Figure 12.11) confirms this.

A normal quantile plot of residual versus normal score. — Figure 12.11 Normal quantile plot of residuals from one-way ANOVA fit to time-to-event data, Example 12.18.

As a result, we consider the square root of time for analysis. These results are summarized in Figures 12.12 and 12.13. The residuals appear approximately Normal (Figure 12.12), and our rule for examining standard deviations indicates that we can assume equal population standard deviations (1.041<2(0.834)).

A Minitab output of a one-way ANOVA analysis. — Figure 12.13 Minitab output giving the one-way ANOVA table for the Facebook profile study after the square root transformation, Example 12.18.

The output is titled, one-way ANOVA, square root time versus group. It shows two tables of data as follows. First table, analysis of variance. Source, group. D F, 4. Adjusted S S, 15.08. Adjusted M S, 3.7701. F-value, 4.55. P-value, 0.002. Source, error. D F, 100. Adjusted S S, 82.95. Adjusted M S, 0.8295. Source, total. D F, 104. Adjusted S S, 98.03. Second table, means. Group, 1. N, 21. Mean, 2.518. Standard deviation, 0.850. 95 percent C I, (2.123, 2.912). Group, 2. N, 21. Mean, 2.585. Standard deviation, 0.892. 95 percent C I, (2.190, 2.979). Group, 3. N, 21. Mean, 2.405. Standard deviation, 0.921. 95 percent C I, 2.010, 2.799). Group, 4. N, 21. Mean, 2.615. Standard deviation, 10.041. 95 percent C I, (2.221, 3.009). Group, 5. N, 21. Mean, 1.600. Standard deviation, 0.834. 95 percent C I, (1.206, 1.995). Pooled standard deviation = 0.910769.

Software Output

The F test is significant with a P-value of 0.002. Because the P-value is very small, there is strong evidence against H0. We can conclude that the five population means are not all the same (F(4,100)=4.55 with P=0.002).

Rejecting the null hypothesis

H0: μ1=μ2=μ3=μ4=μ5

and concluding that the five population means are not the same does not tell us all we’d like to know. We would really like our analysis to provide us with more specific information. For example, the alternative hypothesis is true if

μ1<μ2=μ3=μ4=μ5

or if

μ1=μ2>μ3=μ4>μ5

or if

μ1<μ3<μ4<μ2<μ5

caution When you reject the ANOVA null hypothesis, additional analyses are required to clarify the nature of the differences between the means.

For this study, the researcher predicted that participants would spend more time viewing the negative Facebook pages compared to the positive or neutral pages because the negative pages would stand out more and thus garner more attention. (This is called cognitive salience.) How do we translate these predictions into testable hypotheses?

Example 12.19 A comparison of interest.

The researcher hypothesizes that participants exposed to a negative Facebook profile would spend more time viewing the page than would participants who are exposed to a positive Facebook profile. Because two groups are exposed to negative profiles and two are exposed to positive profiles, we can consider the following null hypothesis:

H01: 12(μ3+μ4)=12(μ1+μ2)

versus the two-sided alternative

Ha1: 12(μ3+μ4)≠12(μ1+μ2)

We could argue that the one-sided alternative

Ha1: 12(μ3+μ4)>12(μ1+μ2)

is appropriate for this problem, provided that other evidence suggests this direction and is not just what the researcher wants to see.

In the preceding example, we used H01 and Ha1 to designate the null and alternative hypotheses. The reason for this is that there is an additional set of hypotheses to assess. We use H02 and Ha2 for this set.

Example 12.20 Another comparison of interest.

This comparison tests whether there is a difference in time between groups exposed to a negative page and the group exposed to the neutral page. Here are the null and alternative hypotheses:

H02: 12(μ3+μ4)=μ5

Ha2: 12(μ3+μ4)≠μ5

Each of H01 and H02 says that a combination of population means is 0. These combinations of means are called contrasts because the coefficients sum to zero. We use ψ, the Greek letter psi, for contrasts among population means. For our first comparison, we have

ψ1=-12(μ1+μ2)+12(μ3+μ4)=(−0.5)μ1+(−0.5)μ2+(0.5)μ3+(0.5)μ4

and for the second comparison

ψ2 = 12(μ3+μ4)-μ5=(0.5)μ3+(0.5)μ4+(-1)μ5

In each case, the value of the contrast is 0 when H0 is true. caution Note that we have chosen to define the contrasts so that they will be positive when the alternative of interest (what we expect) is true. Whenever possible, this is a good idea because it makes results easier to read.

A contrast expresses an effect in the population as a combination of population means. To estimate the contrast, form the corresponding sample contrast by using sample means in place of population means. Under the ANOVA assumptions, a sample contrast is a linear combination of independent Normal variables and, therefore, has a Normal distribution (page 49). We can obtain the standard error of a contrast by using the rules for variances. Inference is based on t statistics. Here are the details.

Contrasts

A contrast is a combination of population means of the form

ψ=∑aiμi

where the coefficients ai sum to 0. The corresponding sample contrast is

c=∑aix¯i

The standard error of c is

SEc=sp∑ai2ni

To test the null hypothesis H0: ψ=0, use the t statistic

t=c SEc

with degrees of freedom DFE that are associated with sp. The alternative hypothesis can be one-sided or two-sided.

A level C confidence interval for ψ is

c±t*SEc

where t* is the value for the t(DFE) density curve with area C between −t* and t*.

Because each x¯i estimates the corresponding μi, the addition rule for means tells us that the mean μc of the sample contrast c is ψ. In other words, c is an unbiased estimator of ψ. Testing the hypothesis that a contrast is 0 assesses the significance of the effect measured by the contrast. A confidence interval provides an estimate of the size of the effect. It is good practice to include a confidence interval with a test of a population contrast, especially when the effect is found statistically significant.

Example 12.21 The contrast coefficients.

For the contrasts in Examples 12.19 and 12.20, the coefficients are

a1=−0.5, a2=−0.5, a3=0.5, a4=0.5, a5=0, for ψ1

and

a1=0, a2=0, a3=0.5, a4=0.5, a5=−1, for ψ2

where the subscripts 1, 2, 3, 4, and 5 correspond to the profiles listed in Example 12.17, respectively. In each case, the sum of the ai is 0.

We now look at inference for each of these contrasts in turn.

Example 12.22 Testing the first contrast of interest.

The sample contrast that estimates ψ1 is

c1=(-0.5)x¯1+(-0.5)x¯2+(0.5)x¯3+(0.5)x¯4=(-0.5)2.518+(-0.5)2.585+(0.5)2.405+(0.5)(2.615)=-0.0415

with standard error

SEc1=0.911(-0.5)221+(-0.5)221+(0.5)221+(0.5)221=0.1988

The t statistic for testing H01: ψ1=0 versus Ha1: ψ1≠0 is

t=c1SEc1=-0.04150.1988=-0.21

Because sp has 100 degrees of freedom, software using the t(100) distribution gives the two-sided P-value of 0.8341. If we use Table D, we conclude that P>2(0.25)=0.50. The P-value is very large, so there is little evidence against H01, the average amount of time viewing a negative or positive Facebook profile is the same.

We use the same method for the second contrast.

Example 12.23 Testing the second contrast of interest.

The sample contrast that estimates ψ2 is

c2=(0.5)x¯3+(0.5)x¯4+(-1)x¯5=(0.5)2.405+(0.5)2.615+(-1)1.600=1.2025+1.3075-1.600=0.91

with standard error

SEc2=0.911(0.5)221+(0.5)221+(-1)221=0.2435

The t statistic for assessing the significance of this contrast is

t=0.910.2435=3.74

The P-value for the two-sided alternative is 0.0003. If we use Table D, we conclude that P<2(0.0005)=0.001. The P-value is very small, so there is strong evidence to conclude that time viewing a negative content page is different from the time viewing a neutral content page.

The size of the difference can be described with a confidence interval.

Example 12.24 Confidence interval for the second contrast.

To find the 95% confidence interval for ψ2, we combine the estimate with its margin of error:

c2±t*SEc2=0.91±(1.984)(0.2435)=0.91±0.48

The interval is (0.43, 1.39). Unfortunately, this interval is difficult to interpret because the units are minutes. We can obtain an approximate 95% interval on the original units scale by back-transforming (squaring the interval endpoints). This results in an approximate 95% confidence interval of the difference to be between 0.18 minute and 1.93 minutes.

JMP output for the two contrasts is given in Figure 12.14. The results agree with the calculations that we performed in Examples 12.22 and 12.23 except for minor differences due to roundoff error in our calculations. Note that the output does not give the confidence interval that we calculated in Example 12.24. This is easily computed, however, from the contrast estimate and standard error provided in the output.

A JMP output for a contrast analysis. — Figure 12.14 JMP output giving the contrast analysis for the Facebook profile study.

Some statistical software packages report the test statistics associated with contrasts as F statistics rather than t statistics. These F statistics are the squares of the t statistics described previously. As with much other statistical software output, P-values for significance tests are reported for the two-sided alternative.

caution If the software you are using gives P-values for the two-sided alternative, and you are using the appropriate one-sided alternative, divide the reported P-value by 2. In our example, we argued that a one-sided alternative may be appropriate for the first contrast. The software reported the P-value as 0.836, so we can conclude P=0.418. Dividing this value by 2 has no effect on the conclusion.

Questions about population means are expressed as hypotheses about contrasts. A contrast should express a specific question that we have in mind when designing the study. Because the ANOVA F test answers a very general question, it is less powerful than tests for contrasts designed to answer specific questions.

caution When contrasts are formulated before seeing the data, inference about contrasts is valid whether or not the ANOVA H0 of equality of means is rejected. Specifying the important questions before the analysis is undertaken enables us to use this powerful statistical technique.

Check-in

12.13 Defining a contrast. Refer to Example 12.18 (page 620). Suppose the researcher was also interested in comparing the viewing time between male and female profile pages. Specify the coefficients for this contrast.
12.14 Defining different coefficients. Refer to Example 12.23 (page 628). Suppose we had selected the coefficients a1=0, a2=0, a3=−1, a4=−1 and a5=2. Would this choice of coefficients alter our inference in this example? Explain your answer.

Multiple comparisons

Data set icon for Vtm.

In many studies, specific questions cannot be formulated in advance of the analysis. If H0 is not rejected, we conclude that the population means are indistinguishable on the basis of the data given. On the other hand, if H0 is rejected, we would like to know which pairs of means differ. Multiple-comparisons procedures address this issue. It is important to keep in mind that multiple-comparisons procedures are generally used only after rejecting the ANOVA H0.

Example 12.25 Comparing each pair of groups.

Let’s return once more to the Facebook study with five groups (page 624). We can make 10 comparisons between pairs of means. We can write a t statistic for each of these pairs. For example, the statistic

t12=x¯1- x¯2sp1n1+1n2=2.518-2.5850.911121+121 =-0.24

compares Profiles 1 and 2. The subscripts on t specify which groups are compared.

The t statistics for two other pairs are

t23=x¯2- x¯3sp1n2+1n3=2.585-2.4050.911121+121 =-0.64

and

t25=x¯2- x¯5sp1n2+1n5=2.585-1.6000.911121+121 =3.50

These 10 t statistics are very similar to the pooled two-sample t statistic for comparing two population means. The difference is that we now have more than two populations, so each statistic uses the pooled estimator sp from all groups rather than the pooled estimator from just the two groups being compared. This additional information about the common σ increases the power of the tests. The degrees of freedom for all these statistics are DFE=100, those associated with sp.

Because we do not have any specific ordering of the means in mind as an alternative to equality, we must use a two-sided approach to the problem of deciding which pairs of means are significantly different.

One obvious choice for t** is the upper α/2 critical value for the t(DFE) distribution. This choice simply carries out as many separate significance tests of fixed level α as there are pairs of means to be compared. The method based on this choice is called the least-significant differences (LSD) method.

caution The LSD method has some undesirable properties, particularly if the number of means being compared is large. Suppose, for example, that there are I=20 groups, and we use the LSD method with α=0.05. There are 190 different pairs of means. If we perform 190 t tests, each with an error rate of 5%, our overall error rate will be unacceptably large. We expect about 5% of the 190 to be significant even if the corresponding population means are the same. Because 5% of 190 is 9.5, we expect 9 or 10 false rejections.

Because LSD fixes the probability of a false rejection for each single pair of means being compared, it does not control the overall probability of some false rejection among all pairs. Other choices of t** control possible errors in other ways. The choice of t** is, therefore, a complex problem, and a detailed discussion of it is beyond the scope of this text. Many choices for t** are used in practice. Most statistical packages provide several to choose from.

We will discuss only one of these, called the Bonferroni method. Use of this method with α=0.05, for example, guarantees that the probability of any false rejection among all comparisons made is no greater than 0.05. This is much stronger protection than controlling the probability of a false rejection at 0.05 for each separate comparison.

Example 12.26 Applying the Bonferroni method.

We apply the Bonferroni method to the data from the Facebook study using an overall α=0.05 level. Given that there are 10 pairs to compare, the value of t** for this method uses α=0.05/10=0.005 for each test. From Table D, this value is 2.871. Of the statistics t12=−0.24, t22=−0.64, and t25=3.50 calculated in Example 12.25, only t25 is significant.

Of course, we prefer to use software for the calculations.

Example 12.27 Interpreting software output.

The output generated by SPSS for comparisons using the Bonferroni method appears in Figure 12.15. Here, all 10 comparisons are reported. In fact, each comparison is given twice. The software uses an asterisk to indicate that the difference in a pair of means is statistically significant. These results agree with our conclusions for the three comparisons in Example 12.26.

An SPSS output of a multiple comparison analysis. — Figure 12.15 SPSS output giving the multiple-comparisons analysis for the Facebook profile study, Example 12.27.

The output is titled, multiple comparisons. Dependent Variable: Square root time. Bonferroni. It shows a table. A table has 20 rows and 7 columns. The columns have the following headings from left to right. I group, J group, Mean difference, Standard error, Significance, 95 percent C I lower bound, Upper bound. Note, an asterisk indicates that the mean difference is significant at the 0.05 level. The row entries are as follows. Row 1. I group, 1. J group, 2. Mean difference, negative 0.06724. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.87409. Upper bound, 0.73962. Row 2. I group, 1. J group, 3. Mean difference, 0.11275. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.6941. Upper bound, 0.9196. Row 3. I group, 1. J group, 4. Mean difference, negative 0.09736. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.90421. Upper bound, 0.70949. Row 4. I group, 1. J group, 5. Mean difference, 0.91714 asterisk. Standard error, 0.28107. Significance, 0.015. 95 percent C I lower bound, 0.11029. Upper bound, 1.724. Row 5. I group, 2. J group, 1. Mean difference, 0.06724. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.73962. Upper bound, 0.87409. Row 6. I group, 2. J group, 3. Mean difference, 0.17999. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.62686. Upper bound, 0.98684. Row 7. I group, 2. J group, 4. Mean difference, negative 0.03012. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.83698. Upper bound, 0.77673. Row 8. I group, 2. J group, 5. Mean difference, 0.98438 asterisk. Standard error, 0.28107. Significance, 0.007. 95 percent C I lower bound, 0.17753. Upper bound, 1.79123. Row 9. I group, 3. J group, 1. Mean difference, negative 0.11275. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.9196. Upper bound, 0.6941. Row 10. I group, 3. J group, 2. Mean difference, negative 0.17999. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.98684. Upper bound, 0.62686. Row 11. I group, 3. J group, 4. Mean difference, negative 0.21011. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 1.01697. Upper bound, 0.59674. Row 12. I group, 3. J group, 5. Mean difference, 80439. Standard error, 0.28107. Significance, 0.051. 95 percent C I lower bound, negative 0.00246. Upper bound, 1.61124. Row 13. I group, 4. J group, 1. Mean difference, 0.09736. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.70949. Upper bound, 0.90421. Row 14. I group, 4. J group, 2. Mean difference, 0.03012. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.77673. Upper bound, 0.83698. Row 15. I group, 4. J group, 3. Mean difference, 0.03012. Standard error, 0.28107. Significance, 1. 95 percent C I lower bound, negative 0.59674. Upper bound, 1.01697. Row 16. I group, 4. J group, 5. Mean difference, 0.21011. Standard error, 0.28107. Significance, 0.005. 95 percent C I lower bound, 0.20765. Upper bound, 1.82136. Row 17. I group, 5. J group, 1. Mean difference, 1.01450 asterisk. Standard error, 0.28107. Significance, 0.015. 95 percent C I lower bound, negative 1.724. Upper bound, negative 0.11029. Row 18. I group, 5. J group, 2. Mean difference, negative 0.91714 asterisk. Standard error, 0.28107. Significance, 0.007. 95 percent C I lower bound, negative 1.79123. Upper bound, negative 0.17753. Row 19. I group, 5. J group, 3. Mean difference, negative 0.80439. Standard error, 0.28107. Significance, 0.051. 95 percent C I lower bound, negative 1.61124. Upper bound, 0.00246. Row 20. I group, 5. J group, 4. Mean difference, negative 1.01450 asterisk. Standard error, 0.28107. Significance, 0.005. 95 percent C I lower bound, negative 1.82136. Upper bound, negative 0.20765.

SPSS does not provide the t statistic but rather a Bonferroni-adjusted P-value for the comparisons under the heading “Sig.” The Bonferroni-adjusted P-value is obtained by multiplying the unadjusted P-value by the number of comparisons. When this is greater than 1, the adjusted P-value is reported as 1.000. In the “Sig.” column, there are only three comparisons with a P-value smaller than the overall α=0.05 level. These are the comparisons of Profiles 1, 2, or 4 versus Profile 5. If you computed the t statistic, you’d find that these are the only comparisons with a t statistic greater than t**=2.871.

When there are a large number of groups, it is often difficult to concisely describe the numerous results of multiple comparisons. Instead, researchers often list the means and use letters to label the pairs that are not found to be statistically different.

Example 12.28 Displaying multiple-comparisons results.

The following table lists the groups of the Facebook study, with their sample sizes, means, and standard deviations. The groups are ordered by x¯. The superscripts on the means denote the sets of groups that were not found to be significantly different.

Profile	n	x¯	s
4. Negative male	21	2.615A	1.041
2. Positive male	21	2.585A	0.892
1. Positive female	21	2.518A	0.850
3. Negative female	21	2.405A,B	0.921
5. Neutral	21	1.600B	0.834

Here, the mean associated with Profile 5 is significantly different from the means for Profile 1, 2, and 4 because they do not have a common label. Profile 5 is not found to be significantly different from the mean for Profile 3 because they share the label “B.” To complicate things, the means for Profiles 1, 2, and 4 are also not found to be significantly different from the mean of Profile 3 as they share the label “A.”

The conclusions from this table in Example 12.28 appear to be illogical. If μ1 is the same as μ3 and if μ3 is the same as μ5, doesn’t it follow that μ1 is the same as μ5? Logically, the answer must be Yes.

This apparent contradiction points out the nature of the conclusions of statistical tests of significance. In the multiple-comparison setting, these tests ask, “Do we have adequate evidence to distinguish two means?” In describing the results, we should talk about failing to detect a difference or concluding that two groups are different. We also must remember that failing to find strong enough evidence that two means differ doesn’t say that they are equal. Thus, it is not illogical to conclude that we have sufficient evidence to distinguish μ1 from μ5 but not μ1 from μ3 or μ3 from μ5.

Simultaneous confidence intervals

One way to deal with the difficulties of interpretation is to give confidence intervals for the differences. The intervals remind us that the differences are not known exactly. We also want to give simultaneous confidence intervals—that is, intervals for all differences among the population means at once. Again, we must face the problem that there are many competing procedures—in this case, many methods of obtaining simultaneous intervals.

The confidence intervals generated by a particular choice of t** are closely related to the multiple-comparisons results for that same method. If one of the confidence intervals includes the value 0, then that pair of means will not be declared significantly different and vice versa.

Example 12.29 Interpreting software output (continued).

The SPSS output for the Bonferroni method given in Figure 12.15 also includes simultaneous 95% confidence intervals in the last two columns. We can see, for example, that the Bonferroni interval for μ1−μ2 is−0.87 to 0.74 The fact that the interval includes 0 is consistent with the fact that we failed to detect a difference between these two means using this method. Use of the Bonferroni method provides us with 95% confidence that all 10 intervals simultaneously contain the true values of the population mean differences. That is why these intervals are wider than those that do not adjust for multiple comparisons.

Check-in

12.15 Why no additional analyses? Explain why it is unnecessary to further analyze the data using a multiple-comparisons method when I=2.
12.16 Growth of Douglas fir seedlings. An experiment was conducted to compare the growth of Douglas fir seedlings under three different levels of vegetation control (0%, 50%, and 100%). Sixteen seedlings were randomized to each level of control. The resulting sample means for stem volume were 58, 73, and 105 cubic centimeters (cm3), respectively, with sp=17cm3. The researcher hypothesized that the average growth at 50% control would be less than the average of the 0% and 100% levels.
1. What are the coefficients for testing this contrast?
2. Perform the test and report the test statistic, degrees of freedom, and P-value. Do the data provide evidence to support this hypothesis?

Power of the one-way ANOVA F test

The power of a statistical test is a measure of the test’s ability to detect deviations from the null hypothesis. In Chapter 7, we described the use of power calculations to ensure adequate sample size for inference about means in one- and two-population studies. In this section, we extend these methods to any number of populations.

Because the one-way ANOVA F test is a generalization of the two-sample t test, it should not be surprising that the procedure for calculating power of the F test is quite similar. In both cases, the following four study specifications are needed to compute power:

Significance level α
Common population standard deviation
Group sample sizes
Specific alternative Ha

The difference comes in the specific calculations. Instead of using a t and noncentral t distribution to compute the probability of rejecting H0 under a specific alternative, we now use an F and a noncentral F distribution. Calculations involving the noncentral F distributions are again not practical to compute by hand but are quite easy with software.

The last three study specifications in the prior list determine the appropriate noncentral F distribution. Most software assumes a constant sample size n for the group sizes. For the alternative, some software doesn’t request group means but rather the smallest difference between means that is judged practically important. Here is an example using two software packages that approach specifying Ha differently.

Example 12.30 Power of a reading comprehension study.

Suppose that a study on reading comprehension for three different teaching methods has 10 students in each group. How likely is this study to detect differences in means? A previous study, using the same three teaching methods but performed in a different setting, found sample means of 40, 48, and 42, and the pooled standard deviation was 7. We’ll use these values to compute the power for this new study.

Figure 12.16 shows the power calculation output from JMP and Minitab. In both cases, we use α=0.05, and n1=n2=n3=10, and σ=7. The difference is in how we input this information and the group means of the specific Ha.

For JMP, we enter the alternative group means and the total sample size N=30. The power is calculated when the “Continue” button is clicked. It is roughly 61%. For Minitab, we enter the common sample size N=10 and the smallest difference between means that is deemed important. The largest difference among the previous sample means is 8=48−40, so that was entered. The calculated power is 57%.

JMP and Minitab calculation outputs. — Figure 12.16 JMP and Minitab power calculation outputs, Example 12.30.

The JMP output shows an expanded dropdown list menu, sample size. Beneath it, a calculator input for k means lists options and text boxes for entering values. The options listed, and values entered, are as follows. Testing is two means are differences among k means Alpha, 0.05. Standard deviation, 7. Extra parameters, 0. Enter up to 10 prospective means showing separation across groups, 40, 48, 42. Enter power or sample size to get the other. Enter neither to get a plot of power versus sample size. Sample size, 30. Power, 0.6082076431. Sample size is the total sample size, per group would be n divided by k. The Minitab output is titled power and sample size. It lists the following data. One-way ANOVA. alpha - 0.05, assumed standard deviation = 7. Factors, 1. Number of levels, 3. Results. Maximum difference, 8. Sample size, 10. Power, 0.571671. The sample size is for each level. To the right of the data is a power curve for one-way ANOVA. The graph plots power on the vertical axis, ranging from 0.0 to 1.0 in increments of 0.2, versus maximum difference on the horizontal axis, ranging from 0 to 16 in increments of 2. An S shaped curve for a sample size of 10 rises with increasing steepness from (0, 0.05) to an inflection point at (8, 0.59), then rises with decreasing steepness through (14, 0.9). All values estimated.

As in this case, the power is usually lower when only specifying an important difference between means. This is because the other population means are not specified, and so the software considers a worst-case scenario.

If the assumed values of the μi in the previous study describe differences among the groups that the experimenter wants to detect, then we would want to use more than 10 subjects per group. Although H0 is false for these μi, the chance of rejecting it at the 5% level is only about 60%. This chance can be increased to higher levels by increasing the sample sizes.

Example 12.31 Changing the sample size.

To decide on an appropriate sample size for the experiment described in the previous example, we repeat the power calculation for different values of n, the number of subjects in each group. Here are the results:

N	DFG	DFE	F*	Power
30	2	27	3.35	0.61
36	2	33	3.28	0.70
45	2	42	3.22	0.81
57	2	54	3.17	0.90
99	2	96	3.09	0.99

When you have a level of power in mind, software will allow you to directly solve for the necessary sample size rather than trying different values for n, as we did in this example. Instead of leaving the power blank, you’d enter the desired power as an input (usually 80% or 90%) and leave the sample size blank. Try this out using JMP for the setting of Example 12.31 and verify that the experimenters need n=19 for 90% power.

A study with 90% power means the experimenters have a 90% chance of rejecting H0 and thereby demonstrating that the groups have different means. In the long run, 90 out of every 100 such experiments would reject H0 at the α=0.05 level of significance. This is much better than the original 61 out of every 100 such experiments when n=10.

There is, however, a cost associated with this increased power. The total sample size is almost doubled, which could mean a doubling of the cost of the study. To trade off cost with the risk of not detecting the difference in the means, researchers generally shoot for power in the 80% to 90% range. In most real-life situations, the additional cost of increasing the sample size so that the power is very close to 100% cannot be justified.

Check-in

12.17 Understanding power calculations. Refer to Example 12.30. Suppose that the researcher decided to use μ1=39, μ2=49, and μ3=42 in the power calculations. With n=10 and σ=7, would the power be larger or smaller than 61%? Explain your answer.
12.18 Understanding power calculations (continued). If all the group means are equal (H0 is true), what is the power of the F test? Explain your answer.

Section 12.2 SUMMARY

The ANOVA F test does not say which group means differ. It is, therefore, usual to add comparisons among the means to one-way ANOVA.
Specific questions formulated before examination of the data can be expressed as contrasts. Tests and confidence intervals for contrasts provide answers to these questions.
If no specific questions are formulated before examination of the data and the null hypothesis of equality of population means is rejected, multiple-comparisons procedures are used to assess the statistical significance of the differences between pairs of means.
The least significant differences (LSD) method controls the probability of a false rejection for each comparison. The Bonferroni method controls the overall probability of some false rejections among all comparisons.
The power of the one-way ANOVA F test depends upon the significance level, the group sample sizes, the common population standard deviation, and the choice of alternative. Software can do the power calculations if these study-specific factors are provided.

Now that you have completed this section, you will be able to:

Distinguish between the use of contrasts to examine particular versions of the alternative hypothesis and the use of a multiple-comparisons method to compare pairs of means. Review Examples 12.19 (page 626) and 12.25 (page 630) and try Exercise 12.25.
Construct a confidence interval or perform a t significance test for a contrast and summarize the results. Review Example 12.22 (page 628) and try Exercise 12.23.
Describe the use of a multiple-comparisons method in terms of controlling false rejections. Review Example 12.26 (page 632) and try Exercise 12.31.
Interpret software output to draw conclusions regarding differences in means. Review Example 12.27 (page 632) and try Exercise 12.33.
Use software to determine the power of the ANOVA F test for a given set of group means and sample size n. Review Example 12.30 (page 635) and try Exercise 12.35.

Section 12.2 EXERCISES

12.20 Define a contrast. An ANOVA was run with six groups. Give the coefficients for the contrast that compares the average of the means of the first four groups with the mean of the last two groups.
12.21 Find the standard error. Refer to the previous exercise. Suppose that there are 10 observations in each group and that sp=5. Find the standard error for the contrast.
12.22 Is the contrast significant? Refer to the previous exercise. Suppose that the average of the first four groups minus the average of the last two groups is 2.6. State an appropriate null hypothesis for this comparison and find the test statistic with its degrees of freedom. Can you draw a conclusion? Or do you need to know the alternative hypothesis?
12.23 Give the confidence interval. Refer to the previous exercise. Give a 95% confidence interval for the difference between the average of the means of the first four groups and the average mean of the last two groups.
12.24 Background music contrast. Refer to Example 12.3 (page 603). The researchers hypothesize that listening to any background music (with or without lyrics) will, on average, result in fewer completed math problems than working in silence. Test this hypothesis with a contrast. Make sure to specify the null and alternative hypotheses, the sample contrast, and its standard error. Also report the t statistic, along with its degrees of freedom and P-value.

12.25 College dining facilities. University and college food service operations have been trying to keep up with the growing expectations of consumers with regard to the overall campus dining experience. Because customer satisfaction has been shown to be associated with repeat patronage and new customers gained through word of mouth, a public university in the Midwest took a sample of patrons from its eating establishments and asked them about their overall dining satisfaction.¹¹ The following table summarizes the results for three groups of patrons:

Category	x¯	n	s
Student—meal plan	3.44	489	0.804
Faculty—meal plan	4.04	69	0.824
Student—no meal plan	3.47	212	0.657

Is it reasonable to use a pooled standard deviation for these data? Why or why not? If yes, compute it.
The ANOVA F statistic was reported as 17.66. Give the degrees of freedom and either an approximate (from a table) or an exact (from software) P-value. Sketch a picture that describes this calculation.
Prior to performing this survey, food service operations thought that satisfaction among faculty on the meal plan would be higher than satisfaction among students on the meal plan. Explain why a contrast should be used to examine this rather than a multiple-comparisons procedure.
Use the results in the table to test this contrast. Make sure to specify the null and alternative hypotheses, test statistic, and P-value.
Suppose food service operations had been interested in comparing faculty on the meal plan to students overall. Repeat part (d) using this question on interest.

12.26 Writing contrasts. You’ve been asked to help some administrators analyze survey data on textbook expenditures collected at a large public university. Let μ1, μ2, μ3, and μ4 represent the population mean expenditures on textbooks for the first-year, second-year, third-year, and fourth-year students, respectively.
1. Because first- and second-year students take lower-level courses, which often involve large introductory textbooks, the administrators want to compare the average textbook expenditures of the first-year and second-year students with those of the third- and fourth-year students. Write a contrast that expresses this comparison.
2. Write a contrast for comparing the mean of the first-year students with that of the second-year students.
3. Write a contrast for comparing the mean of the third-year students with that of the fourth-year students.
12.27 Writing contrasts (continued). Return to the eye study described in Example 12.16 (page 618). Let μ1, μ2, μ3, and μ4 represent the mean scores for blue, brown, gaze down, and green eyes, respectively.
1. Because a majority of the population in this study are Hispanic (eye color predominantly brown), we want to compare the average score of the brown eyes with the average of the other two eye colors. Write a contrast that expresses this comparison.
2. Write a contrast to compare the average score when the model is looking at the viewer versus the average score when the model is looking down.
12.28 Analyzing contrasts. Answer the following questions for the two contrasts that you defined in the previous exercise.
1. For each contrast, give H0 and an appropriate Ha. In choosing the alternatives, you should use information given in the description of the problem, but you may not consider any impressions obtained by inspection of the sample means.
2. Find the values of the corresponding sample contrasts c1 and c2.
3. Calculate the standard errors SEc1 and SEc2.
4. Give the test statistics and approximate P-values for the two significance tests. What do you conclude?
5. Compute 95% confidence intervals for the two contrasts.
12.29 Two contrasts of interest for the stimulant study. Refer to Exercise 12.15 (page 623). There are two comparisons of interest to the experimenter. They are (1) placebo versus the average of the two low-dose treatments and (2) the difference between High A and Low A versus the difference between High B and Low B.
1. Express each contrast in terms of the means (μ’s) of the treatments.
2. Give estimates with standard errors for each of the contrasts.
3. Perform the significance tests for the contrasts. Summarize the results of your tests.
12.30 The Bonferroni method. For each of the following settings, state the α level to be used for each test.
1. There are 21 pairs of means (I=7), and the overall significance level is 5%.
2. There are 36 pairs of means (I=9), and the overall significance level is 1%.
3. There are (I=6) groups, and the overall significance level is 5%.
12.31 Use of a multiple-comparisons procedure. A friend performed a one-way ANOVA at the α=0.05 level to compare the nitrogen levels of I=50 different crop fields from around the state. Among the 1225 mean comparisons, he found 63 significant differences when using the least-significant differences method and only 8 significant differences when using the Bonferroni method. He is confused about why the results differ so much. Explain to him the trade-offs between these two approaches and which results you prefer he use.

12.32 Multitasking with technology in the classroom. Laptops and other digital technologies with wireless access to the Internet are becoming more and more common in the classroom. While numerous studies have shown that these technologies can be used effectively as part of teaching, there is concern that these technologies can also distract learners if used for off-task behaviors.

In one study that looked at the effects of off-task multitasking with digital technologies in the classroom, a total of 145 undergraduates were randomly assigned to one of seven conditions.¹² Each condition involved performing a task simultaneously during lecture. The study consisted of three 20-minute lectures, each followed by a 15-item quiz. The following table summarizes the conditions and quiz results (mean proportion correct):

Condition	n	Lecture 1	Lecture 2	Lecture 3
Texting	21	0.57	0.75	0.56
Emailing	20	0.52	0.69	0.50
Using Facebook	20	0.50	0.68	0.43
MSN messaging	21	0.48	0.71	0.42
Natural use control	21	0.50	0.78	0.58
Word-processing control	21	0.55	0.75	0.57
Paper-and-pencil control	21	0.60	0.74	0.53

For this analysis, let’s consider the average of the three quizzes as the response. Compute this mean for each condition.
The analysis of these average scores results in SSG=0.22178 and SSE=2.00238. Test the null hypothesis that the mean scores across all conditions are equal.
Using the means from part (a) and the Bonferroni method, determine which pairs of means differ significantly at the 0.05 significance level. (Hint: There are 21 pairwise comparisons, so the critical t-value is 3.095. Also, it is best to order the means from smallest to largest to help with pairwise comparisons.)
Summarize your results from parts (b) and (c) in a short report.

12.33 Contrasts for multitasking. Refer to the previous exercise. Let μ1, μ2 ,..., μ7 represent the mean scores for the seven conditions. The first four conditions refer to off-task behaviors, and the last three conditions represent different sorts of controls.
1. The researchers hypothesized that the average score for the off-task behaviors would be lower than that for the paper-and-pencil control condition. Write a contrast that expresses this comparison.
2. For this contrast, give H0 and an appropriate Ha.
3. Calculate the test statistic and approximate P-value for the significance test. What do you conclude?
12.34 Power calculations for planning a study. You are planning a new eye gaze study for a different university than that studied in Example 12.16 (page 618). From Figure 12.9 (page 618), the pooled standard error is 1.68. To be a little conservative, use σ=2.0 for the calculations in this exercise. You would like to conclude that the population means are different when μ1=3.2, μ2=3.7, μ3=3.0 and μ4=4.0.
1. Pick several values for n (the number of students that you will select for each group) and calculate the power of the ANOVA F test for each of your choices.
2. Plot the power versus the sample size. Describe the general shape of the plot.
3. What choice of n would you choose for your study? Give reasons for your answer.
12.35 Power for a different alternative. Refer to the previous exercise. Suppose we increase μ4 to 4.2. For each of the choices of n in the previous example, would the power be larger or smaller under this new set of alternative means? Explain your answer without doing the calculations.