6.3 Use and Abuse of Tests

Carrying out a test of significance is often quite simple, especially if the P-value is given effortlessly by a computer. Using tests wisely is not so simple. Each test is valid only in certain circumstances, and properly produced data is particularly important.

The z test, for example, should bear the same warning label that was attached in Section 6.1 to the corresponding confidence interval (page 341). Similar warnings accompany the other tests that we will learn. There are additional caveats that concern significance tests more than confidence intervals enough to warrant this separate section. Some hesitation about the unthinking use of significance tests is a sign of statistical maturity.

The reasoning of significance tests has appealed to researchers in many fields, and tests are widely used to report research results. In this setting Ha is a “research hypothesis” asserting that some effect or difference is present. The null hypothesis H0 says that there is no effect or no difference. A small P-value represents good evidence that the research hypothesis is true. Here are some comments on the use of significance tests, with emphasis on their use in reporting scientific research.

Choosing a level of significance

The intention of a significance test is to give a clear statement of the degree of evidence provided by the sample against the null hypothesis. The P-value does this. It is common practice, however, to describe the results in terms of statistical significance (i.e., whether Pα). Unfortunately, this yes/no determination of statistical significance oversimplifies the findings. caution This is because there is no sharp border between “significant” and “not significant”; there is only increasingly strong evidence as the P-value decreases. Including the P-value with a description of the detected effect allows for a much clearer conclusion.

Example 6.21 Information provided by the P-value.

Suppose that the test statistic for a two-sided significance test for a population mean is z=1.95. From Table A we can calculate the P-value. It is

P=2[1P(Z1.95)]=2(10.9744)=0.0512

We have failed to meet the standard of evidence for α=0.05. However, with the information provided by the P-value, we can see that the result just barely missed our standard. If the effect in question is interesting and potentially important, we might want to design another study with a larger sample to investigate it further.

Here is another example where the P-value provides useful information beyond that provided by the statement that we reject or fail to reject the null hypothesis.

Example 6.22 More on information provided by the P-value.

We have a test statistic of z=4.66 for a two-sided significance test on a population mean. Software tells us that the P-value is 0.000003. This means that there are 3 chances in 1,000,000 of observing a sample mean this far or farther away from the null hypothesized value of μ0. This kind of event is virtually impossible if the null hypothesis is true. There is no ambiguity in the result; we can clearly reject the null hypothesis.

One reason for the common use of α=0.05 is the great influence of Sir R. A. Fisher, the inventor of formal statistical methods for analyzing experimental data. Here is his opinion on choosing a level of significance: “A scientific fact should be regarded as experimentally established only if a properly designed experiment rarely fails to give this level of significance.”22

Recently, certain scientific communities, which have historically used α=0.05, have proposed lowering the threshold to α=0.005.23 This is in response to the lack of reproducibility of study results. By lowering the threshold, it is argued that an initial discovery has a greater chance of being confirmed in follow-up studies.

These arguments suggest that the choice of α should reflect the goals and costs of the study. It also should involve relevant subject-matter knowledge. Regardless of its choice, however, it’s important to remember that statistical significance can be determined at any threshold, provided that the P-value is given. To assess practical (scientific) significance, a description of the effect is needed. This is discussed next.

What statistical significance does not mean

When a null hypothesis (“no effect” or “no difference”) can be rejected at the usual level α=0.05, there is good evidence that an effect is present. That effect, however, can be extremely small. cautionWhen large samples are available, even tiny deviations from the null hypothesis will be statistically significant. In the era of big data, this is particularly important to keep in mind.

Example 6.23 It’s significant, but is it important?

Suppose that we are testing the null hypothesis of no correlation between two variables. With 400 observations, an observed correlation of only r=0.1 is significant evidence at the α=0.05 level that the correlation in the population is not zero. Figure 6.15 is an example of 400 (x, y) pairs that have an observed correlation of 0.10. The low significance level does not mean that there is a strong association, only that there is strong evidence of some association. The proportion of the variability in one of the variables explained by the other is r2=0.01, or 1%.

A scatterplot with unlabeled x and y axes shows hundreds of points scattered wide across the graph with no discernible pattern.

Figure 6.15 Scatterplot of n=400 observations with an observed correlation of 0.10, Example 6.23. There is not a strong association between the two variables even though there is significant evidence (P<0.05) that the population correlation is not zero.

For practical purposes, we might well decide to ignore this association. caution Statistical significance is not the same as practical significance. Statistical significance rarely tells us about the importance of the experimental results. This depends on subject-matter knowledge and the context of the experiment.

The remedy for attaching too much importance to statistical significance is to pay attention to the actual experimental results as well as to the P-value. Plot your data and examine them carefully. Beware of outliers. cautionThe user of statistics who feeds the data to a computer without exploratory analysis will often be embarrassed. It is usually wise to give a confidence interval for the parameter in which you are interested. Confidence intervals are not used as often as they should be, while tests of significance are overused.

Check-in
  1. 6.26 Is it significant? More than 200,000 people worldwide take the GMAT examination each year when they apply for MBA programs. Their scores vary Normally with mean μ=540 and standard deviation σ=100. One hundred students go through a rigorous training program designed to raise their GMAT scores. Test the following hypotheses about the training program

    H0:μ=540
    Ha:μ>540

    in each of the following situations.

    1. The students’ average score is x¯=556.4. Is this result significant at the 5% level?

    2. Now suppose that the average score is x¯=556.5. Is this result significant at the 5% level?

    3. Explain how you would reconcile this difference in significance, especially if any increase greater than 15 points is considered a success.

Don’t ignore lack of significance

There is a tendency to conclude that there is no effect whenever a P-value fails to attain the usual 5% standard. A provocative editorial in the British Medical Journal titled “Absence of Evidence Is Not Evidence of Absence” deals with this issue.24 Here is one of the examples this editorial cites.

Example 6.24 Interventions to reduce HIV-1 transmission.

A randomized trial of interventions for reducing transmission of HIV-1 reported an incident rate ratio of 1.00, meaning that the intervention group and the control group both had the same rate of HIV-1 infection. The 95% confidence interval was reported as 0.63 to 1.58.25 The editorial notes that a summary of these results that says the intervention has no effect on HIV-1 infection is misleading. The confidence interval indicates that the intervention may be capable of achieving a 37% decrease in infection; it might also be harmful and produce a 58% increase in infection. Clearly, more data are needed to distinguish between these possibilities.

The situation can be worse. Research in some fields has rarely been published unless significance at the 0.05 level is attained.

Example 6.25 Journal survey of reported significance results.

A survey of four journals published by the American Psychological Association showed that of 294 articles using statistical tests, only eight reported results that did not attain the 5% significance level.26 It is very unlikely that these were the only eight studies of scientific merit that did not attain significance at the 0.05 level. Manuscripts describing other studies were likely rejected because of a lack of statistical significance or never submitted in the first place due to the expectation of rejection.

In some areas of research, small effects that are detectable only with large sample sizes can be of great practical significance. Data accumulated from a large number of patients taking a new drug may be needed before we can conclude that there are life-threatening consequences for a small number of people.

On the other hand, sometimes a meaningful result is not found to be significant.

Example 6.26 A meaningful but statistically insignificant result.

A sample of size 10 gave a correlation of r=0.5 between two variables. The P-value is 0.102 for a two-sided significance test. In many situations, a correlation this large would be interesting and worthy of additional study. When it takes a lot of effort (say, in terms of time or money) to obtain samples, researchers often use small studies like these as pilot projects to gain interest from various funding sources. With financial support, a larger, more powerful study can then be run.

caution Another important aspect of planning a study is to verify that the test you plan to use does have high probability of detecting an effect of the size you hope to find. This probability is the power of the test. Power calculations are discussed in Section 7.3 and elsewhere for particular data analysis procedures.

Statistical inference is not valid for all sets of data

In Chapter 3 , we learned that badly designed surveys or experiments often produce invalid results. cautionFormal statistical inference cannot correct basic flaws in the design.

Example 6.27 English vocabulary and studying a foreign language.

There is no doubt that there is a significant difference in English vocabulary scores between high school seniors who have studied a foreign language and those who have not. But because the effect of actually studying a language is confounded with the differences between students who choose language study and those who do not, this statistical significance is hard to interpret. The most plausible explanation is that students who were already good at English chose to study another language. A randomized comparative experiment would isolate the actual effect of language study and so make significance meaningful. Do you think it is ethical to do such a study?

Tests of significance and confidence intervals are based on the laws of probability. Randomization in sampling or experimentation ensures that these laws apply. But we must often analyze data that do not arise from randomized samples or experiments. cautionTo apply statistical inference to such data, we must have confidence in a probability model for the data. The diameters of successive holes bored in auto engine blocks during production, for example, may behave like independent observations from a Normal distribution. We can check this probability model by examining the data. If the Normal distribution model appears approximately correct, we can apply the methods of this chapter to do inference about the process mean diameter μ.

Check-in
  1. 6.27 Home security systems. A recent TV advertisement for home security systems said that homes without an alarm system are three times more likely to be broken into. Suppose that this conclusion was obtained by examining an SRS of police records of break-ins and determining whether the percent of homes with alarm systems was significantly smaller than 50%. Explain why the significance of this study is suspect and propose an alternative study that would help clarify the importance of an alarm system.

Beware of searching for significance

Statistical significance is an outcome much desired by researchers. It means (or ought to mean) that you have found an effect that you were looking for. cautionThe reasoning behind statistical significance works well if you decide what effect you are seeking, design an experiment or sample to search for it, and use a test of significance to weigh the evidence you get. But because a successful search for a new scientific phenomenon often ends with statistical significance, it is all too tempting to make significance itself the object of the search. There are several ways to do this, none of them acceptable in polite scientific society.

Example 6.28 Genomics studies.

In genomics experiments, it is common to assess the differences in expression for tens of thousands of genes. If each of these genes were examined separately and statistical significance declared for all that had P-values that pass the 0.05 standard, we would have quite a mess. In the absence of any real biological effects, we would expect that, by chance alone, approximately 5% of these tests would show statistical significance. Much research in genomics is directed toward appropriate ways to deal with this situation.27

We do not mean that searching data for suggestive patterns is not proper scientific work. It certainly is. Many important discoveries have been made by accident rather than by design. Exploratory analysis of data is an essential part of statistics. We do mean that the usual reasoning of statistical inference does not apply when the search for a pattern is successful. cautionYou cannot legitimately test a hypothesis on the same data that first suggested that hypothesis. The remedy is clear. Once you have a hypothesis, design a study to search specifically for the effect you now think is there. If the result of this study is statistically significant, you have real evidence.

Section 6.3 SUMMARY

  • P-values are more informative than the reject-or-not result of a level α test. Beware of placing too much weight on traditional values of α, such as α=0.05.

  • Very small effects can be highly significant (small P), especially when a test is based on a large sample. A statistically significant effect need not have practical significance. Always plot the data to display the effect you are seeking, and use confidence intervals to estimate the actual values of parameters.

  • Lack of significance does not imply that H0 is true, especially when the test has a low probability of detecting an effect.

  • Significance tests are not always valid. Faulty data collection, outliers in the data, and testing a hypothesis on the same data that suggested the hypothesis can invalidate a test.

  • Many tests run at once will probably produce some significant results by chance alone, even if all the null hypotheses are true.

Section 6.3 EXERCISES

  1. 6.60 What other information is needed? An observational study that involved n=27,391 subjects concluded that those who were frequent thumb-suckers as children lived longer than those who were not frequent thumb-suckers (P=0.00035). What other information from this study would you like to know before forming an opinion of this result?

  2. 6.61 What do you know? A research report described two results that both achieved statistical significance at the 5% level. The P-value for the first is 0.048; for the second it is 0.0002. Do the P-values add any useful information beyond that conveyed by the statement that both results are statistically significant? Write a short paragraph explaining your views on this question.

  3. 6.62 Selective publication based on results. In addition to statistical significance, selective publication can also be due to the observed outcome. A review of 74 studies of antidepressant agents found 38 studies with positive results and 36 studies with negative or questionable results. All but 1 of the 38 positive studies were published. Of the remaining 36, 22 were not published, and 11 were published in such a way as to convey a positive outcome.28 Describe how such selective reporting can have adverse consequences on health care.

  4. 6.63 What a test of significance can answer. Explain whether a test of significance can answer each of the following questions.

    1. Is the sample or experiment properly designed?

    2. Is the observed effect compatible with the null hypothesis?

    3. Is the observed effect important?

  5. 6.64 Vitamin C and colds. In a study to investigate whether vitamin C prevents colds, 400 subjects are assigned at random to one of two groups. The experimental group takes a vitamin C tablet daily, while the control group takes a placebo. At the end of the experiment, the researchers calculate the difference between the percents of subjects in the two groups who were free of colds. This difference is statistically significant (P=0.03) in favor of the vitamin C group. Can we conclude that vitamin C has a strong effect in preventing colds? Explain your answer.

  6. 6.65 How far do rich parents take us? How much education children get is strongly associated with the wealth and social status of their parents, termed “socioeconomic status,” or SES. The SES of parents, however, has little influence on whether children who have graduated from college continue their education. One study looked at whether college graduates took the graduate admissions tests for business, law, and other graduate programs. The effects of the parents’ SES on taking the LSAT test for law school were “both statistically insignificant and small.”

    1. What does “statistically insignificant” mean?

    2. Why is it important that the effects were small in size as well as statistically insignificant?

  7. 6.66 Do you agree? State whether or not you agree with each of the following statements and provide a short summary of the reasons for your answers.

    1. If the P-value is larger than 0.05, the null hypothesis is true.

    2. Practical significance is not the same as statistical significance.

    3. We can perform a statistical analysis using any set of data.

    4. If you find an interesting pattern in a set of data, it is appropriate to then use a significance test to determine its significance.

    5. It’s always better to use a significance level of α=0.05 than to use α=0.01 because it is easier to find statistical significance.

  8. 6.67 Practical significance and sample size. Every user of statistics should understand the distinction between statistical significance and practical importance. A sufficiently large sample will declare very small effects statistically significant. Consider the study of elite female Canadian athletes in Exercise 6.44 (page 366). Female athletes were consuming an average of 2403.7 kcal/d with a standard deviation of 880 kcal/d. Suppose that a nutritionist is brought in to implement a new health program for these athletes. This program should increase mean caloric intake but not change the standard deviation. Given the standard deviation and how calorie deficient these athletes are, a change in the mean of 50 kcal/d to 2453.7 is of little importance. However, with a large enough sample, this change can be significant. To see this, calculate the P-value for the test of

    H0:μ=2403.7Ha:μ>2403.7

    in each of the following situations:

    1. A sample of 100 athletes; their average caloric intake is x¯=2453.7.

    2. A sample of 500 athletes; their average caloric intake is x¯=2453.7.

    3. A sample of 2500 athletes; their average caloric intake is x¯=2453.7.

  9. 6.68 Statistical versus practical significance. A study with 7500 subjects reported a result that was statistically significant at the 5% level. Explain why this result might not be particularly important.

  10. 6.69 More on statistical versus practical significance. A study with 14 subjects reported a result that failed to achieve statistical significance at the 5% level. The P-value was 0.051. Write a short summary of how you would interpret these findings.

  11. NAEP 6.70 Find journal articles. Find two journal articles that report results with statistical analyses. For each article, summarize how the results are reported and write a critique of the presentation. Be sure to include details regarding use of significance testing at a particular level of significance, P-values, and confidence intervals.

  12. 6.71 Drug treatment to stop smoking. A company matches 200 smokers who signed up for the company’s drug treatment with 200 smokers from the general population. Matching was done on length of smoking, number of packs per day, age, and sex. The company then followed the smokers for six months and recorded whether they quit smoking. The company concludes its drug treatment increases the chance of quitting smoking by 50% (P=0.008). Explain why the significance of this study is suspect.

  13. NAEP 6.72 Predicting success of trainees. What distinguishes managerial trainees who eventually become executives from those who, after expensive training, don’t succeed and leave the company? We have abundant data on past trainees—data on their personalities and goals, their college preparation and performance, and even their family backgrounds and hobbies. Statistical software makes it easy to perform dozens of significance tests on these dozens of variables to see which ones best predict later success. We find that future executives are significantly more likely than washouts to have an urban or suburban upbringing and an undergraduate degree in a technical field.

    Explain clearly why using these “significant” variables to select future trainees is not wise. Then suggest a follow-up study using this year’s trainees as subjects that should clarify the importance of the variables identified by the first study.

  14. 6.73 Searching for significance. A research team is looking for risk factors associated with Alzheimer’s disease. The team has decided to investigate roughly 500 different factors, testing each at the α=0.05 level. Explain why these test results would lead to misleading conclusions.

  15. 6.74 More on searching for significance. You perform 1000 significance tests using α=0.05. Assuming that all null hypotheses are true, about how many of the test results would you expect to be statistically significant? Explain how you obtained your answer.

  16. 6.75 Interpreting a very small P-value. Assume that you are performing a large number of significance tests. Let n be the number of these tests. How large would n need to be for you to expect about one P-value to be 0.00001 or smaller? Use this information to write an explanation of how to interpret a result that has P=0.00001 in this setting.

  17. NAEP 6.76 An adjustment for multiple tests. One way to deal with the problem of misleading P-values when performing more than one significance test is to adjust the criterion you use for statistical significance. The Bonferroni method does this in a simple way. If you perform two tests and want to use the α=5% significance level, you require a P-value of 0.05/2=0.025 to declare either one of the tests significant. In general, if you perform k tests and want protection at level α, use α/k as your cutoff for statistical significance. You perform six tests and obtain individual P-values of 0.075, 0.021, 0.285, 0.002, 0.015, and <0.001. Which of these are statistically significant using the Bonferroni procedure with α=0.05?

  18. NAEP 6.77 Significance using the Bonferroni procedure. Refer to the previous exercise. A researcher has performed 12 tests of significance and wants to apply the Bonferroni procedure with α=0.05. The calculated P-values are 0.141, 0.519, 0.186, 0.753, 0.001, 0.008, 0.646, 0.038, 0.898, 0.013, <0.002, and 0.538. Which of these tests reject their null hypotheses with this procedure?