Carrying out a test of significance is often quite simple, especially if the P-value is given effortlessly by a computer. Using tests wisely is not so simple. Each test is valid only in certain circumstances, and properly produced data is particularly important.
The z test, for example, should bear the same warning label that was attached in Section 6.1 to the corresponding confidence interval (page 341). Similar warnings accompany the other tests that we will learn. There are additional caveats that concern significance tests more than confidence intervals enough to warrant this separate section. Some hesitation about the unthinking use of significance tests is a sign of statistical maturity.
        The reasoning of significance tests has appealed to researchers in
        many fields, and tests are widely used to report research results. In
        this setting
        
          The intention of a significance test is to give a clear statement of
          the degree of evidence provided by the sample against the null
          hypothesis. The P-value does this. It is common practice,
          however, to describe the results in terms of statistical significance
          (i.e., whether
          
          This is because
          there is no sharp border between “significant” and “not
            significant”; there is only increasingly strong evidence as the
            P-value decreases.
          Including the P-value with a description of the detected effect
          allows for a much clearer conclusion.
        
            Suppose that the test statistic for a two-sided significance test
            for a population mean is
            
            We have failed to meet the standard of evidence for
            
Here is another example where the P-value provides useful information beyond that provided by the statement that we reject or fail to reject the null hypothesis.
            We have a test statistic of
            
          One reason for the common use of
          
          Recently, certain scientific communities, which have historically used
          
          These arguments suggest that the choice of
          
          When a null hypothesis (“no effect” or “no difference”) can be
          rejected at the usual level
          
When large samples are available, even tiny deviations from the
            null hypothesis will be statistically significant.
          In the era of big data, this is particularly important to keep in
          mind.
        
            Suppose that we are testing the null hypothesis of no correlation
            between two variables. With 400 observations, an observed
            correlation of only
            
            
                Figure 6.15 Scatterplot of
                
          For practical purposes, we might well decide to ignore this
          association.
          
          Statistical significance is not the same as practical
            significance.
          Statistical significance rarely tells us about the importance of the
          experimental results. This depends on subject-matter knowledge and the
          context of the experiment.
        
          The remedy for attaching too much importance to statistical
          significance is to pay attention to the actual experimental results as
          well as to the P-value. Plot your data and examine them
          carefully. Beware of outliers.
          
The user of statistics who feeds the data to a computer without
            exploratory analysis will often be embarrassed.
          It is usually wise to give a confidence interval for the parameter in
          which you are interested. Confidence intervals are not used as often
          as they should be, while tests of significance are overused.
        
                6.26 Is it significant? More than 200,000
                people worldwide take the GMAT examination each year when they
                apply for MBA programs. Their scores vary Normally with mean
                
in each of the following situations.
                    The students’ average score is
                    
                    Now suppose that the average score is
                    
Explain how you would reconcile this difference in significance, especially if any increase greater than 15 points is considered a success.
There is a tendency to conclude that there is no effect whenever a P-value fails to attain the usual 5% standard. A provocative editorial in the British Medical Journal titled “Absence of Evidence Is Not Evidence of Absence” deals with this issue.24 Here is one of the examples this editorial cites.
A randomized trial of interventions for reducing transmission of HIV-1 reported an incident rate ratio of 1.00, meaning that the intervention group and the control group both had the same rate of HIV-1 infection. The 95% confidence interval was reported as 0.63 to 1.58.25 The editorial notes that a summary of these results that says the intervention has no effect on HIV-1 infection is misleading. The confidence interval indicates that the intervention may be capable of achieving a 37% decrease in infection; it might also be harmful and produce a 58% increase in infection. Clearly, more data are needed to distinguish between these possibilities.
The situation can be worse. Research in some fields has rarely been published unless significance at the 0.05 level is attained.
A survey of four journals published by the American Psychological Association showed that of 294 articles using statistical tests, only eight reported results that did not attain the 5% significance level.26 It is very unlikely that these were the only eight studies of scientific merit that did not attain significance at the 0.05 level. Manuscripts describing other studies were likely rejected because of a lack of statistical significance or never submitted in the first place due to the expectation of rejection.
In some areas of research, small effects that are detectable only with large sample sizes can be of great practical significance. Data accumulated from a large number of patients taking a new drug may be needed before we can conclude that there are life-threatening consequences for a small number of people.
On the other hand, sometimes a meaningful result is not found to be significant.
            A sample of size 10 gave a correlation of
            
          
          Another important aspect of planning a study is to verify that the
            test you plan to use does have high probability of detecting an
            effect of the size you hope to find.
          This probability is the power of the test. Power calculations
          are discussed in
          Section 7.3
          and elsewhere for particular data analysis procedures.
        
          In Chapter 3 , we
          learned that badly designed surveys or experiments often produce
          invalid results.
          
Formal statistical inference cannot correct basic flaws in the
            design.
        
There is no doubt that there is a significant difference in English vocabulary scores between high school seniors who have studied a foreign language and those who have not. But because the effect of actually studying a language is confounded with the differences between students who choose language study and those who do not, this statistical significance is hard to interpret. The most plausible explanation is that students who were already good at English chose to study another language. A randomized comparative experiment would isolate the actual effect of language study and so make significance meaningful. Do you think it is ethical to do such a study?
          Tests of significance and confidence intervals are based on the laws
          of probability. Randomization in sampling or experimentation ensures
          that these laws apply. But we must often analyze data that do not
          arise from randomized samples or experiments.
          
To apply statistical inference to such data, we must have
            confidence in a probability model for the data.
          The diameters of successive holes bored in auto engine blocks during
          production, for example, may behave like independent observations from
          a Normal distribution. We can check this probability model by
          examining the data. If the Normal distribution model appears
          approximately correct, we can apply the methods of this chapter to do
          inference about the process mean diameter
          
6.27 Home security systems. A recent TV advertisement for home security systems said that homes without an alarm system are three times more likely to be broken into. Suppose that this conclusion was obtained by examining an SRS of police records of break-ins and determining whether the percent of homes with alarm systems was significantly smaller than 50%. Explain why the significance of this study is suspect and propose an alternative study that would help clarify the importance of an alarm system.
          Statistical significance is an outcome much desired by researchers.
          It means (or ought to mean) that you have found an effect that you
          were looking for.
          
The reasoning behind statistical significance works well if you
            decide what effect you are seeking, design an experiment or sample
            to search for it, and use a test of significance to weigh the
            evidence you get.
          But because a successful search for a new scientific phenomenon often
          ends with statistical significance, it is all too tempting to make
          significance itself the object of the search. There are several ways
          to do this, none of them acceptable in polite scientific society.
        
In genomics experiments, it is common to assess the differences in expression for tens of thousands of genes. If each of these genes were examined separately and statistical significance declared for all that had P-values that pass the 0.05 standard, we would have quite a mess. In the absence of any real biological effects, we would expect that, by chance alone, approximately 5% of these tests would show statistical significance. Much research in genomics is directed toward appropriate ways to deal with this situation.27
          We do not mean that searching data for suggestive patterns is not
          proper scientific work. It certainly is. Many important discoveries
          have been made by accident rather than by design. Exploratory analysis
          of data is an essential part of statistics. We do mean that the usual
          reasoning of statistical inference does not apply when the search for
          a pattern is successful.
          
You cannot legitimately test a hypothesis on the same data that
            first suggested that hypothesis.
          The remedy is clear. Once you have a hypothesis, design a study to
          search specifically for the effect you now think is there. If the
          result of this study is statistically significant, you have real
          evidence.
        
                P-values are more informative than the reject-or-not
                result of a level
                
Very small effects can be highly significant (small P), especially when a test is based on a large sample. A statistically significant effect need not have practical significance. Always plot the data to display the effect you are seeking, and use confidence intervals to estimate the actual values of parameters.
                Lack of significance does not imply that
                
Significance tests are not always valid. Faulty data collection, outliers in the data, and testing a hypothesis on the same data that suggested the hypothesis can invalidate a test.
Many tests run at once will probably produce some significant results by chance alone, even if all the null hypotheses are true.
                6.60 What other information is needed? An
                observational study that involved
                
6.61 What do you know? A research report described two results that both achieved statistical significance at the 5% level. The P-value for the first is 0.048; for the second it is 0.0002. Do the P-values add any useful information beyond that conveyed by the statement that both results are statistically significant? Write a short paragraph explaining your views on this question.
6.62 Selective publication based on results. In addition to statistical significance, selective publication can also be due to the observed outcome. A review of 74 studies of antidepressant agents found 38 studies with positive results and 36 studies with negative or questionable results. All but 1 of the 38 positive studies were published. Of the remaining 36, 22 were not published, and 11 were published in such a way as to convey a positive outcome.28 Describe how such selective reporting can have adverse consequences on health care.
6.63 What a test of significance can answer. Explain whether a test of significance can answer each of the following questions.
Is the sample or experiment properly designed?
Is the observed effect compatible with the null hypothesis?
Is the observed effect important?
                6.64 Vitamin C and colds. In a study to
                investigate whether vitamin C prevents colds, 400 subjects are
                assigned at random to one of two groups. The experimental group
                takes a vitamin C tablet daily, while the control group takes a
                placebo. At the end of the experiment, the researchers calculate
                the difference between the percents of subjects in the two
                groups who were free of colds. This difference is statistically
                significant
                
6.65 How far do rich parents take us? How much education children get is strongly associated with the wealth and social status of their parents, termed “socioeconomic status,” or SES. The SES of parents, however, has little influence on whether children who have graduated from college continue their education. One study looked at whether college graduates took the graduate admissions tests for business, law, and other graduate programs. The effects of the parents’ SES on taking the LSAT test for law school were “both statistically insignificant and small.”
What does “statistically insignificant” mean?
Why is it important that the effects were small in size as well as statistically insignificant?
6.66 Do you agree? State whether or not you agree with each of the following statements and provide a short summary of the reasons for your answers.
If the P-value is larger than 0.05, the null hypothesis is true.
Practical significance is not the same as statistical significance.
We can perform a statistical analysis using any set of data.
If you find an interesting pattern in a set of data, it is appropriate to then use a significance test to determine its significance.
                    It’s always better to use a significance level of
                    
6.67 Practical significance and sample size. Every user of statistics should understand the distinction between statistical significance and practical importance. A sufficiently large sample will declare very small effects statistically significant. Consider the study of elite female Canadian athletes in Exercise 6.44 (page 366). Female athletes were consuming an average of 2403.7 kcal/d with a standard deviation of 880 kcal/d. Suppose that a nutritionist is brought in to implement a new health program for these athletes. This program should increase mean caloric intake but not change the standard deviation. Given the standard deviation and how calorie deficient these athletes are, a change in the mean of 50 kcal/d to 2453.7 is of little importance. However, with a large enough sample, this change can be significant. To see this, calculate the P-value for the test of
in each of the following situations:
                    A sample of 100 athletes; their average caloric intake is
                    
                    A sample of 500 athletes; their average caloric intake is
                    
                    A sample of 2500 athletes; their average caloric intake is
                    
6.68 Statistical versus practical significance. A study with 7500 subjects reported a result that was statistically significant at the 5% level. Explain why this result might not be particularly important.
6.69 More on statistical versus practical significance. A study with 14 subjects reported a result that failed to achieve statistical significance at the 5% level. The P-value was 0.051. Write a short summary of how you would interpret these findings.
                
                6.70 Find journal articles. Find two journal
                articles that report results with statistical analyses. For each
                article, summarize how the results are reported and write a
                critique of the presentation. Be sure to include details
                regarding use of significance testing at a particular level of
                significance, P-values, and confidence intervals.
              
                6.71 Drug treatment to stop smoking.
                A company matches 200 smokers who signed up for the company’s
                drug treatment with 200 smokers from the general population.
                Matching was done on length of smoking, number of packs per day,
                age, and sex. The company then followed the smokers for six
                months and recorded whether they quit smoking. The company
                concludes its drug treatment increases the chance of quitting
                smoking by 50%
                
                
                6.72 Predicting success of trainees. What
                distinguishes managerial trainees who eventually become
                executives from those who, after expensive training, don’t
                succeed and leave the company? We have abundant data on past
                trainees—data on their personalities and goals, their college
                preparation and performance, and even their family backgrounds
                and hobbies. Statistical software makes it easy to perform
                dozens of significance tests on these dozens of variables to see
                which ones best predict later success. We find that future
                executives are significantly more likely than washouts to have
                an urban or suburban upbringing and an undergraduate degree in a
                technical field.
              
Explain clearly why using these “significant” variables to select future trainees is not wise. Then suggest a follow-up study using this year’s trainees as subjects that should clarify the importance of the variables identified by the first study.
                6.73 Searching for significance.
                A research team is looking for risk factors associated with
                Alzheimer’s disease. The team has decided to investigate roughly
                500 different factors, testing each at the
                
                6.74 More on searching for significance. You
                perform 1000 significance tests using
                
                6.75 Interpreting a very small P-value.
                Assume that you are performing a large number of significance
                tests. Let n be the number of these tests. How large
                would n need to be for you to expect about one
                P-value to be 0.00001 or smaller? Use this information to
                write an explanation of how to interpret a result that has
                
                
                6.76 An adjustment for multiple tests. One way
                to deal with the problem of misleading P-values when
                performing more than one significance test is to adjust the
                criterion you use for statistical significance. The
                Bonferroni method does this in a simple way. If
                you perform two tests and want to use the
                
                
                6.77 Significance using the Bonferroni procedure.
                Refer to the previous exercise. A researcher has performed 12
                tests of significance and wants to apply the Bonferroni
                procedure with