UCL Great Ormond Street Institute of Child Health


Great Ormond Street Institute of Child Health


Chapter 8: Nonparametrics

The significance tests shown in the previous chapters are known as parametric  tests. There are certain assumptions that need to be met for the tests to be valid.

One of these assumptions is that the data follow a defined distribution, hence the name parametric (after the parameters in the distribution). Sometimes a transformation can be applied to the data so that the requirements are met. However, this will not always be possible (for example, where data is distributed in a j-shape or where 2 groups have different distributions) and then alternative methods of analysis must be used.

This chapter examines alternative nonparametric methods for testing data and how they work.

Comparing Dissimilar Groups

Normal vs. Diseased Patients

A control group (normal patients, mice, assays etc.) may be compared with a 'diseased' group (diseased patients, infected mice, contaminated assays etc.) after both groups have received some treatment or intervention.

It is not uncommon to find that the controls react to the treatment but that this is not a strong reaction, whereas some of the comparison group show a very strong reaction. Alternatively, the comparison may be between a group of controls and a diseased group some of whom have very 'strange' values as a result of the disease.

Consequently, the control group has values that are normally distributed and the diseased group is highly skew.

Some examples are shown here:

Ref: Jones D, Hopkinson N and Powell R, Autoantibodies in pet dogs owned by patients with systemic lupus erythematosus, The Lancet, 1992; 339: 1378-1380.

Dog Antibody Plot

Ref: Van de Graff E, Jansen H, Bakker M, Alberts C, eeftinck Schattenkerk J and Out T, ELISA of complement C3a in bronchoalveolar lavage fluid, Journal of Immunological Methods, 1992; 47: 241-250.

Van de Gaff et al

Ref: Rodwell R, Taylor K, Tudehope D and Gray P, Capillary plasma elastase a-proteinase inhibitor in infected and non-infected neonates, Archives of Disease in Childhood, 1992; 67: 436-439.

Rodwell Capillary Plasma

For each of these 3 examples, it is not possible to find a transformation that will make all of the groups normally distributed. Any transformation that removes the upward skew of the diseased groups will leave the control group downwardly skew.

In situations where the data cannot be transformed to normality with approximately equal variability between groups, then the t-tests cannot validly be used. It should be noted that most statistical packages will not alert the user to the fact that their tests may be invalid. You should always ensure that the data fits the requirements of the test beforehand.

The alternative methods are known as NON-PARAMETRIC (or distribution-free) tests. They often involve ranking the data, with individuals or items put in order of magnitude and assigned a number which denotes their position in this ordering.

Ranking Data

Ranking the data involves putting the values in numerical order and then assigning new values to denote where in the ordered set they fall. We give the smallest value the number 1, the next largest value the number 2, the next largest number 3 etc.

The numbers 1, 2, 3... 14 that are assigned to the various values are called the ranks. If there are n values in the sample, the largest value will have rank 'n'.

Sometimes there are ties in the data. This means that two or more values are the same, so that there is no strictly increasing order. When this happens, we average the ranks for the tied values.

For example:

To rank the following sample of 14 values:

2 34 -5 -7 25 2 34 34 67 28 -2 0 7 23

Sorting the values into the order of magnitude gives:

-7 -5 -2 0 2 2 7 23 25 28 34 34 34 67

Ranks are assigned:


There are 14 numbers, so the largest number has rank 14.

The ranks 5 and 6 need to be assigned to the two '2's; hence assign rank (5+6)/2 = 5½ to each value 2.

The ranks 11, 12, and 13, need to be assigned to the three '34's, hence assign rank (11+12+13)/3 = 12 to each value 34.

The ranks for the sample are:

Rank Table 2

Non-parametric tests use the ranks rather than the original data values in the subsequent analysis.

The median, or middle ranked value, is used as a measure of centre. Non-parametric tests make comparisons of medians between groups as opposed to parametric tests which compare means. The ranks yield a lot less information than the original values and are not very sensitive to changes in the data. For example, if the highest number in the example sample of 14 above had been 10,067 instead of 67, it would still receive rank 14.

Whilst no distributional assumptions need to be made to use non-parametric tests, they require larger samples to make the same inferences about the populations being considered and should only be used when unavoidable.

This chapter gives details of some of the simpler non-parametric tests. This detail is given to facilitate understanding of this subset of significance tests and how they compare with the parametric tests described in the previous chapters. Whereas the distributional assumptions are different, all significance tests (whether parametric or non-parametric) address some pre-specified null hypothesis and result in a p-value (the probability that the null hypothesis is true given the data). Whether the data is parametric or not, it is always preferable that some measure of precision (usually expressed via a confidence interval) is given with the observed findings.

Confidence Intervals for a Median

Confidence intervals were previously given with sample mean values. These intervals gave a measure of how precisely the sample estimate approximated the population value. The width of the interval depended on:

  • Variability of the data
  • Size of the sample

It is possible to similarly obtain a confidence interval for a sample median, the interpretation of which is directly comparable to that for the mean. However, because of the lack of distributional assumptions, it may not be possible to obtain an exact 95% confidence interval for the median.

Standard errors cannot be calculated for distribution free statistics. However, confidence intervals can be calculated and have the same interpretation; i.e., they consist of the range of population values with which the sample is compatible.

The confidence limits are not necessarily symmetric around the sample estimate (as is the case when standard errors are used to construct the confidence intervals).

The confidence limits are given by actual values in the sample. We choose which values using the following formulae:


For example:

If there are 20 values in the sample:

The median is between the 10th and 11th highest ordered measurements

The 95% confidence interval for the median is given by the values ranked:

10-4.38 = 5.62 and 1+10+4.38 =15.38

Of course, there are no 5.62 and 15.38 ranked values, so we choose the nearest ranks to these and have an APPROXIMATE 95% confidence interval for the median. For the 20 values this will be the 6th to the 15th ranked values.


Consider the differences in temperature between the start and end of surgery for the 12 patients undergoing percutaneous surgery.


The median difference is halfway between the 6th and 7th largest values i.e., between -0.9 and -0.7 = -0.8

There are 12 values and hence:

CI limits ranks

So an approximate 95% confidence interval is given by the 3rd and 10th ranked values (-3.2, -0.2), compared to a 95% confidence interval for the mean which would be (-2.49, -0.57).

Sign Test

We may wish to test whether the sample median differs significantly from some pre-specified hypothesised value. The simplest way to do this is with a sign test.

If the hypothesised median value were true, we would expect approximately half of the sample values to be larger than the hypothesised value and the remaining half to be less than it.

If the hypothesised median is not true, then the numbers of values above and below the hypothesised value may be quite different in our sample.

The sign test considers how likely we were to obtain the observed imbalance if the hypothesised median were true. The following table allows us to obtain p-values.


There were 12 temperature changes; if temperature change was on average zero then we would expect 6 values above zero and 6 below. In this sample, only 1 value is positive and 11 are negative. The table shows that for this split of values above and below the hypothesised value, p=0.0063.

Sign Test Table
Wilcoxon Sign Rank Test

The sign test uses very little of the information in the dataset. No account is taken of the distance from the hypothesised median of the values either side i.e., we would have obtained the same p-value whether the 11 temperature differences below zero were all -0.00001 or whether they were all -10,000.

The Wilcoxon signed rank test considers the ranks of the distances of the measurements from the hypothesised value (in this case zero) and then sums these ranks for the two sections of the dataset (above and below zero). If the median is zero then we would expect the two sums of ranks to be approximately equal.

For example:

  • With a sample of 12, the total of the ranks will be 1+2+3+4+...+12 = 78
  • If the null hypothesis is true, we would expect the ranks for those above 0 to be about 39 (= half of 78) and the ranks of those below 0 to also be about 39.

The lesser of the two rank sums can be referred to the table below, which gives critical values for the Wilcoxon signed rank test for sample sizes up to 25.



Absolute differences from zero (hypothesised median), ranks and signs of differences (in brackets) for the temperature change data are:


Temperature Change Sign Ranks

There is only one positive difference and 11 negative differences. The sum of the ranks for the negative signs 12 + 10.5 + 10.5 + 9 + 8 + 7 + 6 + 5 + 4 + 3 + 1.5 will be more than the total sum of the ranks for the positive signs, which is 1.5.

It is this lesser value (1.5) which is referred to table (table of critical values for the Wilcoxon test):

For sample size 12, a value of 7 would give p=0.01; since 1.5 is less than this, p<0.01.

This tells us that if the true median temperature change were actually zero then we would be very unlikely to have obtained a sample of 12 patients with the sum of the ranks of the values one side of zero as small as 1.5 (a sum of 7 would only be expected one time in 100, p=0.01; so a value as small as 1.5 is even less likely).

The Wilcoxon signed rank test is more sensitive than the sign test since it uses more information from the data sample.

Comparing Two Groups

In previous sections, the temperature changes of a group if patients who had undergone surgery were investigated to determine whether the average temperature change were zero. However, whatever the average change in this group of patients, we would be unable to attribute this change to the surgery unless there was a control group with which to compare any recorded difference.

Suppose there was another group consisting of 10 patients, whose temperature changes were recorded but who were not operated on and that these changes were (in order):


The mean for this second group of 10 patients is -0.18

Assuming approximate normality, the standard error for the difference between the groups can be calculated to be 0.952. The difference in mean changes in temperature between the group of patients undergoing surgery and those not in 1.35 degrees (1.53 - 0.18), which is 1.35/0.952 = 1.42 standard errors from the null hypothesised difference of zero.

Hence, an unpaired 2-sample t-test would have a p-value of 0.171 and a 95% confidence interval for the mean difference of (-0.63, 3.34).

The difference between the two groups can be tested with the nonparametric equivalent of the 2-sample t-test, the Mann-Whitney U test.

Mann-Whitney U Test

This test is the non-parametric equivalent of the two sample t-test. The test is sometimes known as the Wilcoxon 2-sample test, the Mann-Whitney test or just the U-test.

For larger samples, a formula can be used:
  1. Arrange all the observations into a single ranked series. That is, rank all the observations without regard to which sample they are in.
  2. Add up the ranks for the observations which came from sample 1. The sum of ranks in sample 2 follows by calculation, since the sum of all the ranks equals N(N+1)⁄2 where N is the total number of observations.
  3. U is then given by:
U and R1 Formula

Note that there is no specification as to which sample is considered sample 1. An equally valid formula for U is:


The smaller value of U1 and U2 is the one used when consulting significance tables.

The null hypothesis being tested is that the probability that a member of the 1st population drawn at random will exceed a member of the 2nd population drawn at random is 0.5.

Note that the Mann-Whitney U test is not a test of the difference in medians unless some assumptions are made about the distributions.


To compare the temperature changes for the 12 operated patients (group 1) and the 10 operated (group 2), the 22 temperature changes should first be ranked from 1 to 22:


Sum of ranks for group 1 (operated) = 1+4.5+4.5+6+7+8+9.5+11.5+13.5+15+16.5+18 = 115

Sum of the ranks for group 2 (not operated) =2+3+9.5+11.5+13.5+16.5+19+20+21+22 = 138

Therefore U1 = 115-0.5(12) (13) = 115-6(13) = 115-78 = 37

U2 = 138-0.5(10) (11) = 138-5(11) = 138-55 = 83

For sample sizes of 10 and 12 the cut-off for a p-value of 0.05 is 29 and the cut-off for a p-value of 0.01 is 21. The lower value from our data is 37, which is larger than both of these cut-off values, and hence the distributions are not significantly different (i.e., p>0.05).

Mann Whitney Table
Other Nonparametric Tests

There are non-parametric 'equivalents' to most of the parametric tests that we use.

The question of whether to use parametric or non-parametric methods is made analogous to choice of medical treatment by Martin Bland in his book 'An Introduction to Medical Statistics' (page 238, section 12.7).

Here are some extracts:

"For many statistical problems there are several possible solutions, just as for many diseases there are several treatments, similar perhaps in their overall efficacy but displaying variation in their side-effects, in their interactions with other diseases or treatments, and in their suitability for different types of patients. There is often no one right treatment, but rather treatment is decided by the prescriber's judgement of these effects, past experience, and plain prejudice. Many problems in statistical analysis are like this....Our choice of method depends on the plausibility of Normal assumptions, the importance of obtaining a confidence interval, the ease of calculation, and so on.... Some users of statistical methods are very concerned about the implications of Normal assumptions and will advocate non-parametric methods wherever possible; others are too careless of the errors that may be introduced when assumptions are not met.

I sometimes meet people who tell me that they have used non-parametric methods throughout their analysis as if this is some kind of badge of statistical purity. It is nothing of the kind. It may mean that both their significance tests have less power than they might have, and that results are left as 'not significant' when, for example, a confidence interval for a difference might be more informative....Rather we should choose the method most suited to the problem, bearing in mind both the assumptions we are making and what we really want to know."

'Ease of calculation' has become less of a consideration as appropriate software has become more readily available. The ease with which calculations can be performed also means that it is relatively easy to perform both parametric and non-parametric tests when there is doubt as to whether the parametric assumptions are met or not. If the parametric and non-parametric tests lead to conflicting conclusions then this implies that the parametric assumptions were invalid and the results from the non-parametric test should be used.

Non-parametric tests are less susceptible to changes in the data and outliers, not requiring distributional assumptions to be made. If, however, some distributional form can be assumed, then including this information in the analyses will strengthen the inferences that can be made.

Fisher's Exact Test

If the assumptions for using the chi-square test are not met (i.e., small expected numbers in one or more cells), then an alternative hypothesis test to use is Fishers exact.

This non-parametric test assumes that the row and column totals are fixed and considers the possible distributions of data values within the table (given the fixed totals) that would be more extreme than those observed in the current samples.

If there are only a few distributions that would be more extreme then this suggests that the current set-up is unusual if the null hypothesis (equal distribution between groups) were true. This would lead to a low p-value (the null hypothesis is unlikely to be true).

If there are many more extreme distributions then this suggests that the current set-up is not unusual if the null hypothesis were true, leading to a p-value close to 1.

For example:

The fixed row and column totals for the following table are the numbers 19, 58, 39 and 38:


Out of the 19 breastfed, 12 are within the acute group and only 7 are within the persistent group. More extreme tables would have:

  • 13 in the acute group, 6 in the persistent
  • 14 in the acute group, 5 in the persistent
  • 15 in the acute group, 4 in the persistent
  • 16 in the acute group, 3 in the persistent
  • 17 in the acute group, 2 in the persistent
  • 18 in the acute group, 1 in the persistent

Or as the most extreme:

  • 19 in the acute group and none in the persistent

So there are 7 ways in which the data can be distributed which are more extreme than in the samples obtained here.

This number is compared to the total number of ways in which the data could have been distributed such that there were 19 breast feeders amongst one group of 39 patients and another group of 38, to give the Fishers Exact test p-value.

Spearman's Correlation

The Spearman correlation coefficient is the non-parametric equivalent of the Pearson correlation coefficient. It similarly takes values between -1 and +1, but the difference is that it quantifies the extent to which the variables tend to increase or decrease together i.e., the extent to which one variable tends to increase as the other increases or decreases. A value of zero indicates no such tendency.

The Pearson correlation only assesses linear association. A Pearson correlation of 1 or -1 will yield exactly the same values as the Spearman correlation, but a Spearman correlation of +1 or -1 can occur when the Pearson correlation is nearer to zero.

For example:


The Spearman correlation coefficient is equivalent to the Pearson correlation of the ranks.

Recommended Books

This chapter has introduced the simplest of the non-parametric tests. Detail of these tests has been given to illustrate the general approach of the nonparametrics, their dependence on the data ranks and the differences between these and their parametric equivalents.

The chapter has provided little more than an introduction to this vast branch of statistical analyses.

Recommended books for further information about non-parametric tests are:

  • Practical Nonparametric Statistics by W. J. Conover
  • Non-parametric Statistics for the Behavioural Sciences by S. Siegel
  • Applied Nonparametric Statistical Methods by P. Sprent