HYPOTHESIS TESTING
A very common use of statistical techniques is to test a question,
called the hypothesis. Also often termed significance
testing, this approach yields the valuable concept of the "P
value." It is this P value that is often used to
summarize the statistical strength of the data analysis in supporting or rejecting
the hypothesis. If we were comparing groups, we often describe the test as a comparison
of the null hypothesis, which states that there is
no difference between two groups of data, and the alternative
hypothesis, which states that there is a difference. If we were analyzing
a single grouping of data and performing a comparison of two descriptions of the
data, such as determining whether the mean value of the group as calculated from
the data is different from 0, the two hypotheses would be "mean = 0" versus "mean
= measured value ≠ 0."
The P value is computed from the
observed data as the probability of obtaining this data set if in reality the null
hypothesis were true. It is important to note that this calculation assumes the
null hypothesis and is not the same as saying that P
is the probability that the alternative hypothesis is true. If in reality the null
hypothesis is true, it would be unlikely that we observe something else in our data.
In actual experimental situations, all data have some randomness, so we can never
be certain that our observation is not just chance and bad luck.
In hypothesis testing we choose a level of
significance, termed the alpha value,
and declare that if P is less than alpha, the result
is statistically significant and accepted as true. If P
is not less than alpha, we reject the hypothesis (equivalent to accepting the null
hypothesis). Remember the logical sequence: we choose alpha in advance, then compute
P from the data. If P
is less than alpha, we agree to accept the hypothesis and we reject the null hypothesis.
The P value is the chance that the null hypothesis
was true but, through our bad luck, the data came out as observed anyhow. Typically,
we set alpha equal to .05 and want P less than that.
For that P value, it would mean that the chance
of the data coming out this way was less than a probability of 5% (P
= .05) if the null hypothesis were in fact true.
Choosing the level of significance is an arbitrary decision.
We might want to be very sure that we did not erroneously reject the null hypothesis
and set alpha at .01. It is common in biomedical studies to pick an alpha value
of .05. Also be aware that a very small P value
may lead us to reject the null hypothesis but may not give us much information about
the exact nature of the alternative. As an example, tossing a coin and getting six
heads in a row might lead us to reject the null hypothesis that the coin is a fair
coin, but it is not enough information to convince us that the coin always comes
up heads. We calculate the P value from the data
assuming the null hypothesis and use a small value of P
to allow us to reject the null hypothesis, yet that does not fully characterize the
alternative hypothesis.
Power
It should seem intuitively obvious that the ability of a statistical
test to make a valid determination depends on the amount of data available. With
only a few observations or data points, we cannot be as certain about a conclusion
as if we had large amounts of data. If our goal were determining the fairness of
a coin, tossing it twice would not convince us of much; tossing it a thousand times,
however, would give us a good idea about its nature. In statistical hypothesis testing,
the power of a study is a description of the ability
of the study to detect a true difference. A statistical hypothesis test may fail
to achieve significance (fail to get P < alpha)
if either the data truly do not reflect a difference or if there are so few data
points that the mathematics yields a poor P value.
We define a beta error as the probability of falsely
accepting the null hypothesis (calculated by assuming that the alternative hypothesis
is true), and the power = 1 −
beta. For the power to be good, that is, the beta value to be small,
we want our study to have enough data points so that if a difference is observed
between the null and alternative hypotheses, we can accept it with reliability.
A sample size calculation is often performed before an experiment
is undertaken to determine how much data would be necessary to have a reasonable
chance of observing what is guessed to be the observed result. Standard computer
programs are available that will compute the number of data points necessary when
the alpha and power values are provided and some estimate is made of the likely observed
effect. It is important to be cautious about any study that is interpreted as "negative,"
that is, reports no significant difference between the study groups. It may be that
no difference truly exists, but it may also be that the sample size was too small
to determine a difference with statistical significance.