UCL Great Ormond Street Institute of Child Health


Great Ormond Street Institute of Child Health


Chapter 9: Bootstrapping

The basic tenet of statistical analyses is to make inferences about populations from samples. In chapter 5 there was illustration and discussion of how inference is based on the random sample obtained, which would definitely differ if a different sample had been randomly selected. Although we would only have a single sample for any research project, this chapter encouraged thought about the sampling process. On average we get surprisingly accurate information with a relatively small random sample, but we must be aware that on occasion our sample may be less representative of the population. We quantify the chance of this happening and produce a confidence interval that we know will contain the true value on a given (usually 95) percentage of occasions.

We calculated standard errors according to certain distributional assumptions and these could be used to construct confidence intervals. Sometimes these assumptions could not be made or verified, and chapter 8 introduced non-parametric alternatives. A technique for calculating a confidence interval for the median of a single sample was given, but none were presented for more complex scenarios, such as a difference in medians, as the calculations for these are not simple.

An alternative technique, known as bootstrapping, exists for calculating a confidence interval for any sample estimate of a population parameter. Bootstrapping makes no parametric distributional assumptions and is valid even for highly complex scenarios. It is computer intensive and this historically may have been a limitation to widespread usage. Today however, this is not a restriction at all and many packages now have an option to calculate bootstrap confidence intervals. 

In this section, we will explain and illustrate how a bootstrap confidence interval is obtained. 

The basics

A random sample of the population is obtained. This may be a set of birthweights from the local area, differences in temperature changes for a sample of individuals undergoing operation, the percentage of a random sample of nonsyndromic births who have some form of clefting or patients with hypothyroidism to estimate correlation between FMD and TSH. The sample may belong to two groups who have different features, for example those who receive a treatment and those given placebo, children with a disorder and healthy children. 

In each case, the sample(s) provided are the only information available (for this research study) about the population of interest. Rather than making any distributional assumptions (for example that the population is normally distributed and the sample gives a valid estimate of the population standard deviation), we use repeated sampling of the data available to better understand the potential population values. We know that the values in the sample are feasible (since we have sampled them so they must exist) and the distribution of these in the sample will give us an idea of the population distribution of values. 

To bootstrap, we must generate new samples of the same size as the original. Sampling is done with replacement otherwise we would get the same set of data repeatedly.  For each new sample generated, the population parameter we are interested in is estimated. This gives a collection of population estimates (one per generated new sample) and the distribution of these informs us of possible population values.

For example: Suppose our study gives us a sample of 10 data values taken from some population and these are:

5.4, 2.3, 4.7, 32, 0.3, -0.1, 22.2, -7.6, 1.2, 1.4

When we randomly sample from these 10, each value has a 10% (1 in 10) chance of being selected. Because we are sampling with replacement, each value always has a 10% chance of being the next selected value. 

Some examples of samples of size 10 with replacement from the above dataset are:

i) 2.3, -7.6, 4.7, 2.3, 32, -0.1, 5.4, 22.2, -0.1, 1.2

ii) 4.7, 32, 22.2, 5.4, 5.4, 4.7, 2.3, 22.2, 2.3, 5.4

iii) 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2, 1.2,

iv) 22.2, 0.3, 32, -0.1, -7.6, 1.4, 1.4, 1.2, 4.7, -0.1

v) 2.3, 4.7, 1.4, 32, 0.3, -0.1, 5.4, 22.2, -7.6, 1.2

Each of these samples is known as a bootstrap sample. For each sample we can estimate the population parameter we are interested in (mean, median, difference between two subsamples etc.) and this gives a distribution of bootstrap sample estimates of the population parameter. 

For example, if we were trying to estimate the mean value of the population, then each of the bootstrap samples would give an estimate of this. There would be 5 different estimates from the 5 samples above. The sample means will clearly be different. The value for sample (iii) will be 1.2, the estimate from sample (ii) will be larger than for the other samples since mostly large values were sampled.

When constructing a bootstrap confidence interval, many samples will be generated, far more than the 5 above. Since, computation is not generally a problem, then 10s of thousands of samples may be generated. This yields a comprehensive distribution of potential sample estimates. The centiles of that distribution are used to give an overall population estimate (the median or 50th centile is used to give this) and also a confidence interval. Remember that this is a distribution of sample estimates (similar to the distribution at the foot of page 108). We take the 2.5th and 97.5th centiles of the distribution of bootstrap sample estimates, to provide our estimated 95% confidence interval for the population parameter since this will contain 95% of sample estimates. 

Example 1: birthweight data

At the beginning of Chapter 5, we were given a random sample of 30 birthweights from a 'population' of 17,333. Within that population, then mean birthweight was 3263.57g and it was this quantity that the sample was aiming to estimate. The sample of 30 had mean 3379.5g, standard error was estimated as 108.85, giving a 95% confidence interval of (3379.5 ± 1.96 (108.85)) = (3379.5 ± 213.3) = (3166.2, 3592.8 grams)

The 30 birthweights as given earlier in Chapter 5 are reproduced here:

Data Example 1

10,000 samples of size 30 were taken with replacement from these 30 values. For each of the 10,000 samples the mean was calculated and the barchart of these is give below:


The median, 2.5th and 97.5th centiles of this distribution of 10,000 are 3377.3, 3178.2 and 3591.9 respectively. Hence the bootstrap estimate of the population mean is 3377.3 with 95% ci (3178.2, 3591.9 grams). 

So, without making any distributional assumptions the bootstrap approach has very nearly replicated the parametric method. 

Example 2: estimating a proportion

The same methods can be used to estimate a population proportion or percentage. 

Suppose that 15% (or 0.15) of the population have some feature that they will test positive for. The aim of a study is to estimate the positivity rate and a random sample of 57 individuals is tested. 

Within the sample obtained, there are 7 who test positive, giving a sample estimate of the population proportion = 7/57 = 0.123. A 95% confidence interval for the population proportion can be calculated to be (0.038, 0.208).

To obtain bootstrap estimates, samples of size 57 are randomly taken from the 57 observed i.e. random samples from a set of values where 7 are positive and 50 negative. For each sample, the proportion positive is recorded.

Results of the proportions obtained from 20,000 bootstrap estimates are shown in the barchart below:

Histogram Example 2

The distribution is clearly skew, but we can still use the centiles of the distribution to estimate the population proportion and to calculate a 95% confidence interval for this.

The 50th, 2.5th and 97.5th centiles of the 20,000 bootstrap sample proportion estimates are 0.123, 0.053 and 0.211 respectively. This gives a bootstrap estimate of 0.123 with 95% confidence interval (0.053, 0.211), which again is very close to the estimates based on parametric assumptions.

Example 3: sampling from a skewed distribution

Suppose there is a serum level measure, made on a continuous scale from zero upwards, that has a markedly skew upwards distribution (lognormal with mean 1, sd 1.6):

Probability Density Example 3

The true median of this distribution is 2.718 (= exp(1)) and it is this that a sample, taken from this population, wishes to replicate.

Taking a random sample of 63 serum measures, we obtain the values:

1.496, 4.995, 1.364, 16.021, 586.671, 8.278, 9.371, 8.460, 5.456, 80.644 ,21.280, 1.770, 1.765, 0.067, 1.332, 3.419, 7.239, 0.872, 2.135, 15.480, 2.859, 10.882, 0.650, 0.419, 0.315, 1.313, 5.121, 8.307, 1.456, 2.758, 21.181, 2.738, 5.585, 19.316, 2.036, 0.539, 5.430, 1.442, 1.667, 0.113, 1.053, 0.089, 49.127, 1.480, 6.429, 15.602, 3.792, 0.951, 0.114, 3.375, 0.550, 2.240, 0.064, 1.532, 0.332, 7.258, 0.078, 0.767, 25.256, 0.651, 60.190, 1.682, 0.787

The values are (as expected) clearly skew, with a major outlier of 586.671:

Histogram Example 3

The median of this sample is 2.134 with 95% ci (1.480, 4.995)

If we assume lognormality (which in this case is the truth), then logging the values leads to a much more symmetric distribution:


The mean (95% ci) for this sample, based on a normal approximation can be calculated as 0.92 (0.475, 1.366).

Exponentiating these values gives the estimated median and 95% confidence interval of 2.511 (1.608, 3.919).

Bootstrap samples of size 63 would be taken with replacement from the raw (unlogged) values as listed above. Taking 25,000 such samples and the median of each, gives the following distribution:


Again, this is skew but we can use the sample centiles to estimate the population median. The bootstrap estimate of the population median (95% confidence interval) based on 25,000 bootstrap samples is 2.135 (1.480, 3.792). 

The bootstrap estimates again replicate fairly closely the parametric estimates. 

Although in this example, the best approximation to the truth was given by assuming log-normality, this is something that we may not be aware of. If it were known that the serum values are lognormally distributed, then we could log the values and bootstrap the logged values. Taking 25,000 bootstrap samples from the logged estimates and the mean of these, gives a distribution:

Histogram 4 Example 3

And an estimated mean (95% ci) on the logged scale of 0.918 (0.481, 1.36). Expontenting, this gives median and 95% ci of 2.505 (1.618, 3.898).

Example 4: Difference between two groups

In Chapter 7 (paired data) the temperature changes of 12 patients undergoing an operation were given. In the section on non-parametric data a further set of 10 patients who had not undergone surgery, but who had their temperature changes recorded, were given. The aim was to see whether surgery was associated with temperature change. The average changes in the two groups were -1.53 and -0.18 respectively, giving a difference of 1.35 with 95% ci (-0.63, 3.34) assuming parametric requirements valid.

Dotplot Example 4

The bootstrap samples now consist of 12 taken with replacement from the operated group and 10 taken with replacement from the unoperated group, with the difference in means of these groups recorded for each pair of samples. This process is replicated 25,000 times to give 25,000 bootstrap sample estimates of the difference in mean change of the groups:

Histogram Example 4

If we preferred to use medians and take the difference in those (being unsure that means were valid maybe), then 20,000 bootstrap estimates give an estimated median difference of 0.8 with 95% ci (-1.25, 3.00).

The estimated mean change from the bootstrap approach is 1.35 (95% ci (-0.505, 3.212))


 Statistical inference generally assumes that the only information you have about the population is from the sample obtained. Most randomly selected samples will have a distribution that reflects that of the population. With bootstrapping we exploit this to extract as much information as possible from the values that we know are in the population (since they are in the sample) and the best information we have on the population distribution of the values. This is comparable with using the sample estimates (mean, sd, proportion etc.) to estimate population standard error and use that to construct a confidence interval around the parameter estimate (mean, difference in means, percentage etc.) from that particular sample (which is our best available estimate of the population parameter). The advantage with bootstrapping is that no distributional assumptions are needed whatsoever. There remains the possibility that our sample, although random, gives an extreme and therefore unrepresentative estimate. We accept that 5% of the confidence intervals obtained will not actually contain the true population value, but this is a limitation whatever analytic techniques are used. 

If parametric assumptions can be made, then these may enhance the analyses. We saw in the examples that the bootstrap estimation gave results that were comparable (and generally near identical) to estimates incorporating some distributional assumptions. Where data was skewed, following a lognormal distribution, then more accurate estimates were obtained via both methods when log-normality was assumed and incorporated into the analyses. However, we would often not be in a situation where we could make such a decision with certainty from the data available, but on occasion may be able to draw on other information. 

The bootstrap examples given were relatively simple and based on standard analytic scenarios. The techniques can however be applied much more widely and in theory any level of complexity can be built in. The sampling framework and analysis need to be structured as per the study, to address whatever specified research question is mooted. 

Multilevel structures and adjustments for covariates can be readily accommodated into the process.

In the examples given, the number of bootstrap samples taken varied between 10,000 and 25,000. There is no hard and fast rule as to how many is enough. For most applications, computation is fast and so there is no reason not to generate large numbers of samples. The number needs to be large enough that there are not great fluctuations between results. If computation is a problem for some reason, then going through the process with the selected (or easily possible) number of simulations a few times and checking results do not materially change can give assurances.

In summary, bootstrapping is a highly flexible and valid technique for constructing a confidence interval around any sample estimate of a population parameter. It is available in most statistical packages or can be programmed relatively simply and tailored to specific scenarios. This chapter has given examples of usage and shown how these simple non-parametric intervals accord well with those based on theoretical distributions.