In the previous chapter we looked at the calculation of p-values, significance testing, and the relationship between confidence intervals and significance tests. The tests in chapter 6 focused on independent groups and samples.
This chapter will look at paired data, where there is non-independence between the groups or measurements that will be compared, and how the same principles of significance testing are applied to paired data.
- Pairing in Data
Sometimes there is a natural pairing to the data and it is important not to lose this in the analysis. Usually the pairing has been imposed as part of the study design and this has been done for the purpose of avoiding confounding.
The pairing may be:
(i) Between individuals - This may occur in observational or experimental studies. Pairs of individuals who are similar with respect to potential confounders are selected.
For example, we may choose a control child of the same age, sex, gestational age and ethnicity to compare their lung function and IQ to children with some disorder. Because of the matching we can be sure that any differences between the children with disease and healthy controls cannot be due to any difference in ages, sexes, gestational age or ethnicity between the groups.
Alternatively, we may recruit pairs of individuals of similar disease severity and medical history to randomly allocate to 2 different treatments - one of each pair will receive each treatment.
NOTE that the groups in either scenario are pair matched. If the groups had similar distribution of, for example, age and sex, but were not specifically pair matched, then this would not be paired or matched data and this section would not apply.
(ii) Within person - measurements are made on the same person or item under different conditions, usually whilst undergoing different treatments. If the order of treatment is randomized, this is commonly known as a CROSSOVER TRIAL.
These trials may be a very efficient way of determining whether a treatment is more effective than something else (a different treatment or a placebo), since they are much less prone to confounding than if different groups of individuals had been given the different treatments.
However, there are some points to note:
- The order of treatments MUST be randomized. It might be tempting to do a 'before' and 'after' type of study whereby everyone has nothing (or standard treatment) initially, is measured, and then crosses over to the test treatment but this will not give useful results.
- We will not know that any change is due to treatment as the patients may have changed/improved anyway - particularly as we often recruit individuals into trials when they are at a particularly bad point and are hence likely to improve if left alone. (Think of your own happiness, if we were to study you at a point where you are feeling a bit down and then look again a week later, you are likely to have improved your mood regardless of any intervention. This phenomenon, due to the fact that many factors/disease measurements fluctuate irrespective of treatment, is known as regression to the mean ie. if we recruit someone at a low point they are likely to improve, if we recruit them at a high point they are likely to deteriorate.)
- There should be a WASHOUT period between the treatment phases. This may be short or long depending on how long needs to be left between the end of one treatment and the start of the other to be fairly confident that there is no CARRYOVER effect of the treatment from the first study period.
- They can only be used where the underlying condition is reasonably constant and the treatments are only expected to provide short term relief. - If the first treatment cured or permanently improved the condition, then there would be nothing to trial the 2nd treatment on. Similarly, if the disease is progressing and the outcome in the absence of treatment not constant, then there will be a difference in the conditions on which the two treatments are trialled.
- There should be no ORDER EFFECT, whereby the order in which the treatments is given affects the outcomes.
PAIRED OR UNPAIRED?
It is not uncommon to find paired data analysed using unpaired tests. If the pairing has been successful in isolating confounders (which is the purpose of using matching in data recruitment) then not retaining the pairs in the analysis will be wasteful of the data.
Confidence intervals may be wider than necessary, leading to inconclusive results. If there is a pairing of the data then a paired test is usually appropriate.
The appropriate test to use will depend on whether the outcome is numeric or categorical.
- Paired Numeric Data
If numeric outcomes are assessed within matched pairs (e.g., measuring the same individuals measured on different treatments (crossover trial), or sex and age matched pairs of individuals) then the difference within pairs can be calculated to quantify the difference between groups (of treatments or disease types) whilst retaining the matching.
Having calculated these differences, the analysis can proceed on these differences without reference back to the original data values for the individuals. The within pair differences contain the information needed to determine whether treatment is effective, or disease is influential.
If the null hypothesis (no difference) is true then the within pair differences will be normally distributed around zero.
The average within pair difference can be expected to differ from zero by an amount dependent on the standard error. The standard error will depend on the sample size and the variability of the within pair differences.
Ref: Lloyd, Kirk, Dean and Kyle, Effect of percutaneous nephrolithotomy on thermoregulation, British Journal of Urology, 1992; 69: 132-36.
The aim of the study is to see whether there is a shift in temperature from start to end. This is the same as asking whether the within pair difference is on average zero. (If there is no overall shift then patients temperatures will fluctuate randomly, some will go up and some will go down, on average the difference will be zero).
This can be tested using a paired t-test.
- Paired t-test
A paired t-test can be used to determine whether the within pair difference obtained is larger than would be expected to have occurred by chance. The probability that it is a chance occurrence is given by the p-value obtained from the test. The procedure for the paired t-test is identical to the 1-sample t-test apart from the fact that it is performed on paired differences rather than single observations.
For the sample of 12 patients in the table, the mean difference is -1.53, standard deviation 1.71. If there is no change in temperature on average, then the means of samples of size 12 will be normally distributed around a mean of zero, and a standard error
The observed mean is (1.53 - 0)/0.49 = 3.12 standard errors away from the hypothesised mean of zero.
The normal distribution table shows that the observed mean would occur less than twice in 1000 samples if there were no change on average (p < 0.002).
A more precise p-value (0.0018) can be obtained from the normal table spreadsheet:
A 95% confidence interval for the average change is given by:
(-1.53 ± 1.96(0.49)) = (-1.53 ± 0.96) = (-2.49, -0.57°C)
I.e., the average fall in temperature from start to end of operation is between 0.57 and 2.49°C.
It may be tempting to attribute this fall in temperature to the effects of surgery. There is, however, no control group (i.e. a group not undergoing surgery) with which to compare the results. It may be argued that the patients have acted as their own controls since their temperatures were measured before surgery, but there may be other factors that have caused the observed reduction. Perhaps it was simply colder in the operating theatre than in the ward from which the patients came.
Without a proper control group it is impossible to make any statements about the effect of surgery.
- Complete Analysis of a Crossover Trial
17 asthmatic patients are randomly allocated to receive one of two treatments, A or B. They have their lung function measured post this treatment and, after a washout period, receive the other treatment and a final lung function measurement. A high measurement represents better lung function.
Ref: B. Jones and M.G. Kenward, Design and Analysis of Cross-over Trials, Chapman and Hall, 1995.
Plotting the data
A scatterplot can be used to show each individual's measurements on the two treatments. Different symbols can be used to indicate the order of treatment for each individual and the line of equality superimposed for ease of interpretation. This chart is shown below.
The measurements for the 8 individuals in group 1 (A first, B second) are shown as solid circles and the measurements for the 9 individuals in group 2 (B first, A second) are shown as stars.
The line of equality (measurement after treatment A = measurement after treatment B) is superimposed on the plot. For most of the asthmatics tested their lung function after receiving B was higher than after they received treatment A (points lie above the line of equality).
Plotting the means of each group over time is also informative.
The mean of group 1 after period 1 (treatment A) is 1.57, and after treatment B is 1.69. The mean of group 2 after period 1 (treatment B) is 2.34, and after treatment A is 1.95.
Is the carry-over from A to B (group 1 2nd period) the same as from B to A (group 2 2nd period)?
The null hypothesis being tested is that the carry-over effects are equal between groups 1 (A first) and group 2 (B first).
This is tested by comparing the sum of the values over both treatment periods between groups 1 and 2; i.e., the 8 totals from group 1 (1.28+1.33, 1.60+2.21... 2.41+2.79) with the 9 totals from the individuals in group 2 (3.06+1.38, 2.68+2.10... 1.16+1.25).
The average sum in group 1 is 3.2625 and the average sum in group 2 is 4.2867, giving a difference of -1.024.
A 2-sample t-test yields a p-value 0.13 and the conclusion is that the carry-over effects are not significantly different.
Note: We can only test the treatment effect if it can be assumed that the carryover effects are equal. Hence it is necessary to perform the test above first. If a significant difference is found in carry-over, then this subsequent test will be invalid.
The null hypothesis being tested is that the treatment effects are equal. This is tested by comparing the difference between period 1 and period 2 for the 2 groups. For group 1, the differences will be for after time spent on treatment A minus those after time spent on treatment B (1.28-1.33, 1.60-2.21... 2.41-2.79). For group 2, differences will be for after time spent on treatment B minus those after time spent on treatment A (3.06-1.38, 2.68-2.10... 1.16-1.25).
The average difference for the 8 individuals in group 1 (A-B) is -0.1175, and for the 9 individuals in group 2 (B-A) the average difference is 0.3956. A 2 sample t-test of the difference in these differences yields a p-value of 0.047
To estimate the size of the treatment effect we take half of the difference between the average differences for each group i.e., 0.5 x (-0.1175-0.3956) = 0.5 x -0.5131 = -0.2565 (since (A-B)-(B-A) = 2A-2B = 2 (A-B) and we want half of this)
The output from the 2 sample t-test gives a 95% confidence interval for the difference - 0.5131 of (-1.019, -0.007) and hence a 95% confidence interval for the estimated difference in treatment effects (A-B) of -0.2565 is (-0.5095, -0.0035)
We conclude that B has a significantly better effect on lung function (on average) than treatment A.
It may be of interest to formally test the period effect (i.e., is there a change over time irrespective of treatment). To do this we need to assume that there is no significant treatment effect.
The null hypothesis being tested is that changes between treatments are equal irrespective of time. This is tested by comparing the difference between treatment A and treatment B for the 2 groups. For group 1, the differences will be for after time period 1 minus those after time period 2 (1.28-1.33, 1.60-2.21... 2.41-2.79). For group 2, differences will be for after time period 2 minus those after time period 1 (1.38-3.06, 2.10-2.68... 1.25- 1.16).
The average difference for the 8 individuals in group 1 is -0.1175, and for the 9 individuals in group 2 the average difference is -0.3956. A 2 sample t-test of the difference in these differences yields a p-value of 0.26.
To estimate the size of the period effect we take half of the difference between the average differences for each group i.e., 0.5 x (-0.1175-(-0.3956)) = 0.5 x (-0.1175+0.3956) = 0.5 x 0.2781 = 0.139 (since (P1-P2)-(P2-P1) = 2P1-2P2 = 2 (P1-P2) and we want half of this).
The output from the 2 sample t-test gives a 95% confidence interval for the difference 0.2781 of (-0.228, 0.784) and hence a 95% confidence interval for the estimated period effect (P1-P2) of 0.139 is (-0.114, 0.392).
- Paired Categorical Data
The first stage in the analysis of paired numeric data was to calculate the within pair differences which were then the basis for analysis. With binary data it is not possible to calculate a difference for each pair of observations.
If one member of the pair has a blood pressure of 80 and their matched pair has a blood pressure of 84, then the within pair difference can be calculated to be 4mmHg.
BUT if one member has a positive reaction and the other has a negative reaction, it does not make sense to have a 'difference' = positive - negative.
It is important to display the paired binary data in a way that retains the pairing.
The following table shows the results of an experiment in which specimens of sputum from 50 subjects were each cultured on two different media TM01 and TM02 in order to compare the media in their ability to detect tubercle bacilli:
Since we are interested in comparing the proportions of specimens found positive on the two media, we might summarize these results in a table:
From this table we can see that 64% of specimens were found positive on medium TM01 and 44% positive on TM02. However the paired nature of the data is 'lost' in this table. From this table it is not possible to determine how many of the 22 positive on TM02 were also positive on TM01.
A better display of the data is given below:
This table gives the same information as the previous table (32 positive on TM01, 22 positive on TM02) but also details the correspondence between pairs of outcomes on the same subjects.
Of the specimens, 36 gave the same result on both media. The 14 who disagreed give information as to whether one media was more likely to detect tubercle bacilli than the other.
If the two media were equally likely to detect the bacilli we would expect to find about half of the 14 disagreeing positive on TM01 and negative on TM02 and the other half negative on TM01 and positive on TM02; i.e., we would expect to see 7 in each of the off diagonal boxes of the above table. The numbers we observe in this sample are 12 and 2 respectively.
This can be tested using a McNemar's Test.
- McNemar's Test
A test of the difference (McNemar's test) in the proportion positive on the 2 media taking into account the paired nature of the data considers how unlikely a difference of 10 (12-2) in the off diagonals is if the two media are equally likely to detect (or not detect) the bacilli.
The null hypothesis therefore is that of those disagreeing (i.e., the 14 off diagonals in the table), 50% are expected to be positive on TM01 and negative on TM02 (and similarly the other 50% are expected to be negative on TM01 and positive on TM02).
A confidence interval for the difference in proportions (or percentages) of those testing positive on the two media ((32-22)/50 = 0.2 or 20%) can be constructed using a standard error which takes into account:
- The total number of pairs observed
- The number of pairs for which there is disagreement (dT = d+-+d-+)
- The extent to which the disagreements are more likely to be in one direction than the other
The formula for the standard error for the difference in these proportions is given by:
In this example, this leads to the standard error:
A 95% confidence interval is given by (0.2 ± 1.96(0.069)) = (0.065, 0.335).
There is some evidence that there is a difference in the detection rate, with TM01 detecting tubercle bacilli between 6.5 and 33.5% more often than TM02.
You can access an excel spreadsheet that will calculate standard errors and confidence intervals for paired binary proportions here:
- Paired Binary Data: Small Samples and Extreme Proportions
As previously discussed, there is an alternative method for constructing confidence intervals that can be used if the method given on the previous page is invalidated by small samples and/or extreme proportions. The method is detailed below.
For a 95% confidence interval for the difference, first calculate the limits for each sample separately as given on page 121. ((l1, u1) and (l2 , u2) as for two sample unpaired case).
A 95% confidence interval for the difference is given by:
Applying this method to the sputum test data, we have:
W = 32 x 18 x 22 x 28 = 354,816
X = (20 x 16) - (12 x 2) = 296
Y = 296 - 50/2 = 271 (since X > n/2)
The proportion testing positive on Medium TM01 is 0.64 (= 32/50); following the calculations for a 95% confidence interval for a single proportion given on page 121:
The 95% confidence interval is from 0.502 to 0.759, hence l1 = 0.502 and u1 = 0.759.
The proportion testing positive on Medium TM02 is 0.44 (= 22/50); following the calculations for a 95% confidence interval for a single proportion given on page 121:
The 95% confidence interval is from 0.367 to 0.521, hence l2 = 0.312 and u2 = 0.577.
Hence a 95% confidence interval for the difference in paired proportions is given by:
Note: (0.056, 0.329) is similar to the interval found using the first method.
You can access an excel spreadsheet that will calculate standard errors and confidence intervals for paired binary proportions here: