Book Text

HYPOTHESIS TESTING

A very common use of statistical techniques is to test a question, called the hypothesis. Also often termed significance testing, this approach yields the valuable concept of the "P value." It is this P value that is often used to summarize the statistical strength of the data analysis in supporting or rejecting the hypothesis. If we were comparing groups, we often describe the test as a comparison of the null hypothesis, which states that there is no difference between two groups of data, and the alternative hypothesis, which states that there is a difference. If we were analyzing a single grouping of data and performing a comparison of two descriptions of the data, such as determining whether the mean value of the group as calculated from the data is different from 0, the two hypotheses would be "mean = 0" versus "mean = measured value ≠ 0."

The P value is computed from the observed data as the probability of obtaining this data set if in reality the null hypothesis were true. It is important to note that this calculation assumes the null hypothesis and is not the same as saying that P is the probability that the alternative hypothesis is true. If in reality the null hypothesis is true, it would be unlikely that we observe something else in our data. In actual experimental situations, all data have some randomness, so we can never be certain that our observation is not just chance and bad luck.

In hypothesis testing we choose a level of significance, termed the alpha value, and declare that if P is less than alpha, the result is statistically significant and accepted as true. If P is not less than alpha, we reject the hypothesis (equivalent to accepting the null hypothesis). Remember the logical sequence: we choose alpha in advance, then compute P from the data. If P is less than alpha, we agree to accept the hypothesis and we reject the null hypothesis. The P value is the chance that the null hypothesis was true but, through our bad luck, the data came out as observed anyhow. Typically, we set alpha equal to .05 and want P less than that. For that P value, it would mean that the chance of the data coming out this way was less than a probability of 5% (P = .05) if the null hypothesis were in fact true.

Choosing the level of significance is an arbitrary decision. We might want to be very sure that we did not erroneously reject the null hypothesis and set alpha at .01. It is common in biomedical studies to pick an alpha value of .05. Also be aware that a very small P value may lead us to reject the null hypothesis but may not give us much information about the exact nature of the alternative. As an example, tossing a coin and getting six heads in a row might lead us to reject the null hypothesis that the coin is a fair coin, but it is not enough information to convince us that the coin always comes up heads. We calculate the P value from the data assuming the null hypothesis and use a small value of P to allow us to reject the null hypothesis, yet that does not fully characterize the alternative hypothesis.

Power

It should seem intuitively obvious that the ability of a statistical test to make a valid determination depends on the amount of data available. With only a few observations or data points, we cannot be as certain about a conclusion as if we had large amounts of data. If our goal were determining the fairness of a coin, tossing it twice would not convince us of much; tossing it a thousand times, however, would give us a good idea about its nature. In statistical hypothesis testing, the power of a study is a description of the ability of the study to detect a true difference. A statistical hypothesis test may fail to achieve significance (fail to get P < alpha) if either the data truly do not reflect a difference or if there are so few data points that the mathematics yields a poor P value. We define a beta error as the probability of falsely accepting the null hypothesis (calculated by assuming that the alternative hypothesis is true), and the power = 1 − beta. For the power to be good, that is, the beta value to be small, we want our study to have enough data points so that if a difference is observed between the null and alternative hypotheses, we can accept it with reliability.

A sample size calculation is often performed before an experiment is undertaken to determine how much data would be necessary to have a reasonable chance of observing what is guessed to be the observed result. Standard computer programs are available that will compute the number of data points necessary when the alpha and power values are provided and some estimate is made of the likely observed effect. It is important to be cautious about any study that is interpreted as "negative," that is, reports no significant difference between the study groups. It may be that no difference truly exists, but it may also be that the sample size was too small to determine a difference with statistical significance.