There are many different statistical significance, or hypothesis, tests. They all follow the same basic principle. The appropriate test for a given situation depends on the nature of the data being analysed.
This chapter explains p-values, gives a detailed description of significance testing, and discusses the relationship between confidence intervals and significance tests.
- Null hypothesis
All statistical significance tests start with a null hypothesis. A statistical significance test measures the strength of evidence that the data sample supplies for or against some proposition of interest.
This proposition is known as a 'null hypothesis', since it usually relates to there being 'no difference' between groups' or 'no effect' of a treatment.
- CMV infected babies have the same average birthweight as non-infected babies;
- ß-thalassaemia does not have any effect on ferritin level;
- Recombinant human erythropoietin has no effect on the haemoglobin levels in premature infants.
Even if our hypothesis is not to do with 'no difference' it is still convention that the hypothesis being tested is known as the null hypothesis.
Having set up the null hypothesis, we calculate the probability of obtaining the observed data sample if the null hypothesis were true. This probability is known as the p-value.
As it is a probability, the p-value lies between 0 and 1. Smaller p-values suggest that the null hypothesis is less likely to be true.
The null hypothesis is never totally disproved but may be shown to be highly unlikely.
For example, a p-value of 0.001 means that a sample as extreme as that observed would occur 1 time in 1000 if the null hypothesis was true. We may believe that this chance is small enough to discount and act (clinically/practically) as though the null hypothesis were false.
NOTE: If the p-value is small, then we say that the data is unlikely to have occurred if the null hypothesis were true. We have NOT disproved the null hypothesis; the sample is unlikely BUT NOT IMPOSSIBLE.
If the p-value is not small, then we say that the data is consistent with the null hypothesis. We have NOT proved the null hypothesis. The sample is consistent with the hypothesised population, however IT MAY ALSO BE CONSISTENT WITH OTHER POPULATIONS, which may be very different and lead to widely differing conclusions for the study.
There are many different types of statistical significance test. The test appropriate for a given situation will depend on the type of outcome variable being studied (categorical or numeric) and the number of variables that are being considered. All statistical significance tests yield a p-value which quantifies how likely some null hypothesis is to be true.
In the previous chapter we learnt about standard errors, how these are calculated, and how they attach a level of precision to a sample estimate. Standard errors are also used in the calculation of p-values. P-values are obtained from the table of the normal distribution according to the number of standard errors the observed sample statistic is from the hypothesized value.
Parametric significance tests all follow this process. The names of the significance tests vary according to the outcome measurement.
In order to further clarify the usage and interpretation of p-values, the next section describes a simple statistical significance test (the one-sample t-test) in detail. This is done in an attempt to help understand why the process is reasonable, why it works, and what the limitations may be. Following on from the full description of the one-sample t-test, there are examples of other significance tests in action.
- One Sample t-test
This test is appropriate when a continuous variable is recorded on a single sample of individuals or items and the mean value is to be tested against some hypothesised population mean.
For example: We may know the average birthweight expected amongst normal babies and wish to establish whether this differs consistently amongst babies born with a particular problem known as congenital cytomegalovirus (subsequently abbreviated to CMV).
We could use the average birthweight for normal babies given in the previous chapter (3263.57g). We know that if the average birthweight of CMV babies is 3263.57g (the hypothesised mean), then the means of samples of those babies will be normally distributed around mean 3263.57g. The extent to which the sample mean would be expected to differ from the population value can be quantified using the standard error. We would expect 95% of the samples to have values within ± 1.96 standard errors of the population mean value. We could draw the distribution on which the observed sample mean will lie if the null hypothesis is true.
If the observed sample mean lies towards the centre of this distribution then the observed sample is compatible with the null hypothesis:
If it lies in the tails of the distribution then the null hypothesis appears unlikely to be true:
The number of standard errors away from the hypothesised mean that the observed mean lies, will give a measure of how likely it is that the observed sample actually came from the hypothesised distribution. A p-value can be obtained from the table of the normal distribution seen previously and shown below. Just read 'number of standard errors' rather than standard deviations in the first column:
An observed mean 2.58 standard errors (or more) away from the hypothesised mean will occur with probability 0.01 (p=0.01); if the observed mean is 0.84 standard errors away then p=0.4 etc.
It is unlikely that the number of standard errors will be exactly equal to one of the values given in the reduced table above (such as 2.58 or 0.84).
We can see from the table that a value of 2.1 or more extreme, which is between 1.96 and 2.33, will occur between 0.05 and 0.02 of the time by chance. Using the spreadsheet we can see that the exact proportion is 0.036 i.e., If the sample came from a population with that mean then a sample mean as extreme (or as different from the hypothesised population) would occur by chance 0.036 of the time. Alternatively, we would expect between 3 and 4 samples in every hundred drawn to have a mean that far away from the hypothesised population mean.
The birthweights of 53 CMV infected babies were recorded and they had an average birthweight of 3060.75g, standard deviation 601.03g. Using the sample value of 601.03g to estimate the population standard deviation, an estimate of the standard error is obtained by dividing this by the square root of the sample size (53). This gives an estimated standard error of 82.57g.
The observed mean 3060.75g is a distance 3263.57-3060.75 = 202.82g from the hypothesised mean; this is 202.82 /82.57= 2.45 standard errors.
From the normal distribution table, we obtain 0.01 < p < 0.02, or simply p < 0.02. From the spreadsheet we get the more exact p-value of 0.014
So, the sample mean obtained from the 53 babies was significantly less than could be expected to have occurred by chance and we conclude that CMV babies do have a lower average birthweight than non-infected babies.
Note that our conclusion COULD BE WRONG. It is possible that we have just happened, by chance, to have randomly selected one of the 1.4 samples in every 100 of that size that would have a mean so different from the non-infected baby average.
- Level of Significance
If a p-value is low we say that the observed sample value is significantly different from the hypothesised population value. The lower the p-value, the more significant it is said to be. If the p-value is very low, we say the result is highly significant.
It is fairly common practice (but becoming less so) for p-values of less than 0.05 to be called 'significant', whilst those > 0.05 are said to be 'non-significant'. This is NOT GOOD PRACTICE.
P-values are probabilities and there is no sudden changeover from values being 'likely' to being 'unlikely' at some point of the distribution. Values become 'less likely' the further away from the mean and the more into the tails of the distribution we get.
When reporting p-values it is always best to give the actual value.
Sometimes results are expressed as 'significant at x%'.
'Significant at 5%' means that p < 0.05; 'significant at 1%' means that a p-value of less than 0.01 was obtained from the normal table etc. etc.
- One- or Two-Sided Test?
The normal distribution table gives both 1- and 2-sided p-values (columns 3 and 4 shown above). The 2-sided p-values are twice as large as the 1-sided. Both are shown for your information, although one can easily be calculated from the other.
A one-sided test is only appropriate if a large difference in one direction is impossible or would lead to the same action as no difference at all. This is not a common situation in medical research.
Expectation of a difference in one direction is not adequate justification. For example, we may expect a new treatment to be better than standard but if it is actually harmful (i.e. leads to a difference in the opposite direction to that expected) then we would not want to ignore this.
TWO-SIDED TESTS SHOULD BE USED unless there is a very good reason for doing otherwise. If one-sided tests are to be used the direction of the test must be specified in advance. One sided tests should never be used simply as a device to increase the significance of a result.
- The Relationship Between Confidence Intervals and p-values
A confidence interval is a range of population values with which the sample data are compatible. A significance test considers the likelihood that the sample data has come from a particular hypothesised population.
The 95% confidence interval consists of all values less than 1.96 standard errors away from the sample value, testing against any population value in this interval will lead to p > 0.05. Testing against values outside the 95% confidence interval (which are more than 1.96 standard errors away) will lead to p-values < 0.05.
Similarly, the 99% confidence interval consists of all values less than 2.58 standard errors away from the sample value, testing against any hypothesised population value in this interval will give a p-value > 0.01. Testing against values outside the 99% confidence interval (which are more than 2.58 standard errors away) will lead to p-values < 0.01. In general:
1) The mean birthweight of 53 CMV infected babies was 3060.75g (standard deviation = 601.03g, standard error = 82.57g).
A 95% confidence interval for the population mean birthweight of CMV infected babies is therefore given by:
(3060.75 ± 1.96(82.57)) = (2898.91, 3222.59g)
Similarly, the 99% confidence interval for the mean is:
(3060.75 ± 2.58(82.57)) = (2847.72, 3273.78g)
We are 95% confident that the true mean is somewhere between 2898.91 and 3222.59g, testing against values outside this range will lead to p-values < 0.05.
We are 99% confident that the true mean is between 2847.72 and 3273.78g (notice that this is a wider interval, we are more confident that it contains the population mean). Testing against values within this range will lead to p-values > 0.01.
The test given previously showed that the sample mean was significantly different from a hypothesised population mean of 3263.57g. The p-value for that test was 0.014 and this corresponds to the hypothesised population mean of 3263.57g lying outside the 95% confidence interval but inside the 99%.
2) A sample of 33 boys with recurrent infections have their diastolic blood pressures measured. Their mean blood pressure is 62.5 mmHg, standard deviation 8.2.
Using the sample standard deviation to estimate the population standard deviation, samples of size 33 will be distributed with standard error:
Therefore, a 99% confidence interval for the mean diastolic blood pressure of boys with recurrent infections is (62.5 ± 2.58(1.43)) = (58.81, 66.18mmHg).
We wish to know whether boys with recurrent infections are different from boys in general who are known to have pressures of on average 58.5mmHg. The null hypothesis to be tested is that the 33 boys come from a population with a mean dbp of 58.5mmHg.
The observed sample mean is 62.5 - 58.5/ 1.43 = 2.797 standard errors away from the hypothesised mean of 58.5mmHg.
Consulting the table of the normal distribution, we find 0.002 < p < 0.01. Using the linked spreadsheet we get the exact p-value of 0.005, a sample with a mean 4mmHg away from the hypothesised value would occur by chance one time in 200 (5 in 1000).
The 99% confidence interval does not contain the hypothesised mean and p < 0.01 as expected.
- Other Parametric Significance Tests
Any of the estimates for which a standard error can be calculated (pages 112- 114) can be tested against some null hypothesized value in similar way:
- Calculate the difference between the hypothesized value and the sample estimate
- Express this difference as a number of standard errors
- Use the Normal Distribution table or spreadsheet to obtain a p-value
These are known as the parametric significance tests. They make assumptions about the distribution of the sample parameter estimate.
i) One sample t-test
Ref: Elsherif et al. Indicators of a more complicated clinical course for pediatric patients with retropharyngeal abscess. International Journal of Pediatric Otorhinolaryngology, 2010: 74; 198-201.
15 patients with a complicated clinical course (CCC) had an average hospital duration of 7.6 days (sd 4.4 days), standard error =
Suppose we want to compare with a standard average stay of 4 days (available from other, national, data). The difference is 3.6 days and this is 3.6/1.14 = 3.16 standard errors, which (referring to the table on page 83) gives p<0.002, so their average stay is significantly shorter.
ii) Two sample (unpaired) t-test
Ref: As (i)
Patients with a complicated clinical course (CCC) were compared to those with a smooth clinical course (SCC). The 115 with SCC had an average hospital duration of 5.4 days (sd 2.9) compared to an average 7.6 days (sd 4.4) for the 15 with CCC. The standard error for the difference of 2.2 days was calculated to be 0.85.
2.2/0.85 = 2.588 which means that p is approximately 0.01 (i.e., the chance of observing a difference in average stay that large if there is no difference in the populations from which these groups were taken is about 1 in 100).
iii) One sample test of a proportion
Ref: Dai et al. Time trends in oral clefts in Chinese newborns: data from the Chinese national birth defects monitoring network. Birth Defects Research (Part A), 2010; 88; 41-47.
Of 6961 non-syndromic births between 1996 and 2005 with some form of clefting, 976 had a cleft palate alone (i.e., no cleft lip). This is 0.14 (95% ci (0.132, 0.148)). Hence the proportion seen is significantly different at the 5% level to all values outside this interval. For example, if we tested whether the proportion seen in the sample was compatible with a population proportion of 0.25, we would obtain a p-value < 0.05. Testing against a population proportion of 0.145 would give p>0.05.
The exact p-values obtained can be found by considering the differences of the sample proportion (0.14) from these hypothesized values in terms of standard errors and referring to normal tables. For the hypothesized value 0.25, this is (0.25-0.14)/0.004 = 27.5 se, giving p<0.0000005 (from normal table available on web link).
For the hypothesized value 0.145, this is 0.005/0.004 = 1.25 se and p=0.2113.
iv) Chi-square test (χ2) for proportions
Ref: As (iii)
Of the syndromic births, a higher proportion (279/1172 = 0.238) were cleft palate only.
The difference in proportions of 0.098 (=0.238 - 0.140), has standard error 0.018. This is significantly different to zero (0.098-0)/0.018 = 5.444 se, giving p=0.0000000521.
v) One sample test of percentage
Ref: As (iii)
14% (95% ci (13.2, 14.8%)) of non-syndromic births between 1996 and 2005 with some form of clefting had a cleft palate alone (i.e., no cleft lip). Hence the percentage seen is significantly different at the 5% level to all values outside this interval. For example, if we tested whether the percentage seen in the sample was compatible with a population value of 25%, we would obtain a p-value < 0.05. Testing against a population value of 14.5% would give p>0.05.
The exact p-values obtained can be found by considering the differences between the 14% found in the sample and these hypothesized values in terms of standard errors and referring to normal tables. For the hypothesized value 25%, this is (25-14)/0.4 = 27.5 se, giving p<0.0000005.
For the hypothesized value 14.5%, this is 0.5/0.4 = 1.25 se and the exact p-value is 0.2113
vi) Chi-square test (χ2) for percentages
Ref: As (iii)
Of the syndromic births 23.8% were cleft palate only. The difference in percentages of 9.8% has standard error 1.8%.
9.8/1.8 = 5.444 giving p=0.0000000521
For estimates 7-9, the estimate is transformed appropriately and the standard error calculated on the transformed scale. To perform significance tests, we need to calculate the number of standard errors difference between the actual and hypothesized values (on the transformed scale) to obtain a p-value from the table of the normal distribution (page 83 in these notes or using the excel spreadsheet).
vii) Relative Risk
Ref: Kiani et al. Prevention of soccer-related knee injuries in teenaged girls. Archives of Internal Medicine, 2010; 170(1): 43-49.
Players given a training program of exercises designed to reduce knee injury had a relative risk of knee injury compared to controls of 0.226
Ln(0.226) = -1.49 and the standard error of logged RR = 0.622
A relative risk (RR) of 1 means no difference between groups. Ln(1) = 0 and hence testing a hypothesized RR of 1 is equivalent to testing a ln(RR) of zero.
The observed ln(RR) of -1.49 is 1.49/0.622 = 2.395 standard errors away from the hypothesized value of zero. This gives p=0.01662 (which ties in with the fact that the 95% ci (0.07, 0.76) does not contain 1, yet the 99% ci (0.0455, 1.12) does ie. 0.01<p<0.05).
viii) Odds Ratio
Ref: Kabir et al. Active smoking and second-hand-smoke exposure at home among Irishchildren, 1995-2007; Archives of Disease in Childhood; 2010: 95, 42-45.
Ln(0.89) = -0.12, standard error 0.1146The odds ratio comparing smoking prevalence between 1995 and 1998 was 0.89.
Comparing to an odds ratio of 1 (i.e., no difference), ln(1)=0:
0.12/0.1146 = 1.047, p=0.295 i.e., no significant difference.
ix) Correlation Coefficient
Ref: Oliviero et al. Effects of long-term L-thyroxine treatment on endothelial function and arterial distensibility in young adults with congenital hypothyroidism. European Journal of Endocrinology (2010), 162:289-294.
The correlation between Flow Mediated Dilation (FMD) and pubertal mean TSH in 32 patients with congenital hypothyroidism was -0.81:
-1.125 is therefore 1.125/0.186 = 6.05 standard errors away from the hypothesized value and p=0.00000000145 (obtained using the normal table excel sheet).
- Alternative Significance Tests
We have seen how parametric significance tests can be used when standard errors can be validly calculated from the sample data. In chapter 9 we consider alternatives when these cannot validly be used (i.e., when samples are too small, proportions too extreme and/or distributions skew). These alternatives are similar in that a null hypothesis is proposed and the extent to which the observed data supports this quantified in terms of a p-value.
There are many general rules which apply to all significance tests and the interpretation of confidence intervals and p-values. These are discussed in the remainder of this chapter.
Validity of the Tests
It should be borne in mind that the calculation of standard errors was only valid under certain criteria which were given in chapter 5. Recall:
- Sample(s) >20
- Sample(s) Approx. Normally Distributed
- Standard Deviations/Variances Approx. Equal
- Sample(s) >20
- Percentage(s)/Proportion(s) Non-Extreme
Using data transformation as a means of inducing normality was introduced in chapter 3.
Sometimes it is possible to transform the data prior to analysis so that it is suitable for the one and two-sample t-tests described.
There are a variety of transformations that can be tried, although it was previously illustrated that some data (e.g. variables with a J-shaped distribution) can never be normalised.
- When analysing paired differences, for example using a paired t-test, it should be remembered that it is the DIFFERENCES that are being analysed. It is the paired differences that may require transformation to meet the normality requirements and the shape of the distributions of the values that were used to form those differences become irrelevant. The differences may be approximately normally distributed even though the individual distributions are distinctly skew.
- When two or more unpaired groups of values are to be compared, if they require transformation prior to analysis the SAME TRANSFORMATION SHOULD BE USED FOR EACH GROUP. Transformation of the groups may be used to either remove skewness or to make the variances approximately equal.
(1) Ref: Everall et al, Neuronal loss in the frontal cortex in HIV infection, The Lancet, 1993; Vol 337.
This study compared the number of neurons between individuals with HIV and a non-infected control group. The mean (sd) in the HIV and control groups were 307 (46) and 499 (113) respectively. The neurons amongst the control group had a substantially greater standard deviation than the HIV group.
After log transformation the two groups had more even spread, with transformed mean (sd) of 2.50 (0.051) and 2.69 (0.094). Provided the distribution post-transformation was approximately normal it would be valid to use a 2-sample test to compare the transformed means, but the test would have been invalid prior to transformation.
(2) Ref: Nash M, Shah V, Dillon MJ. Unpublished data.
The aim of the study was to compare serum levels of anti endothelial cell antibodies (AECA) from patients with Kawasaki disease (KD) and two groups of control patients, one of whom were febrile and the other who were not. These measurements are lognormally distributed. Log transforming the data will normalise all 3 groups. Log transformations were made and 2-sample t-tests were performed on the transformed values to compare KD with the febrile and afebrile controls. The plot below shows the untransformed data and p-values from t-tests performed on the transformed data.
Transformations are used to normalise data for the purposes of statistical analysis only.
After the data has been suitably transformed, the appropriate analyses can be performed on the transformed data and the appropriate p-value calculated.
If the transformation has succeeded in normalising the data, then, comparing the mean(s) on the transformed scale is equivalent to comparing the median(s) on the untransformed scale. This is because transformation does not change the relative ordering of the measurements.
Any hypothesised population values should be converted to the transformed scale before testing. If confidence intervals are required, then these should be calculated on the transformed data and the limits (or edges of the interval) transformed back to the original scale to be reported.
Note that if 2 or more groups are transformed prior to comparison, the confidence interval for the difference on the transformed scale cannot usually be easily interpreted in terms of the original scale. One exception is when log transformations are used; in this case the confidence limits when back transformed are for the ratio of one group to the other.
The serum bilirubin values obtained from 216 patients with primary biliary cirrhosis were shown previously to be distinctly upwardly skew. The logged values are approximately normally distributed:
The mean of the logged data is 3.56, standard deviation 0.976.
Transforming the data does not change the order of the values and the transformed data is approximately normally distributed, hence the antilog of the mean on the logged scale will be approximately equal to the median of the untransformed data.
Antilog (3.56) = 35.16 which is approximately the median of the untransformed values.
A 95% confidence interval for the mean = (3.56 ± 1.96(0.066)) = (3.43, 3.69)
Converting this interval back to the original units (or anti-logging i.e., exponentiating) gives an interval (30.88, 40.04).
Suppose we want to test whether these patients have average levels in excess of the upper normal limit of serum bilirubin, which is known to be 12. The null hypothesis is that the average serum bilirubin level in primary biliary cirrhosis patients is 12. The values are logged so that a t-test can be used. The null hypothesis value of 12 also needs to be transformed [Log(12) = 2.48]. The observed mean of the logged data (3.56) is (3.56 - 2.48)/0.066 = 16.36 standard errors away from the hypothesised value of 2.48
Hence p<0.002 and there is strong evidence that these patients have raised serum bilirubin levels.
Next, suppose we wish to test whether there are significant differences between the serum bilirubin values found in this group and a group of 200 normal individuals. First we would need to take logs of all serum bilirubin values from the normal individuals.
Suppose these logged values have a mean of 250 and standard deviation 0.82. Antilog (250) = 12.182 and this is approximately the median of the untransformed values.
Difference between patients and controls on log scale = 3.56 - 2.50 = 1.06
95% CI for the difference on the log scale
= (1.06 ± 1.96(0.0884))
= (1.06 ± 0.17)
= (0.89, 1.23)
95% CI for the difference on the untransformed scale = (2.43, 3.42)
i.e. we are 95% confident that the patients' values are on average between 2.43 and 3.42 times higher than those seen in normal individuals.
Compare this with the difference observed in medians. For the patient group the median was approximately 35.16 and this is 2.886 (= 35.16/12.182) times larger than the median for normal individuals.
Note that antilog(1.06) = 2.886 i.e. the antilog of the difference in sample means on the logged scale is equivalent to the ratio of the sample medians of the untransformed data samples.**