With small samples it is possible to relate all of the information that has been gathered regarding the variables of interest. For larger samples, frequency distributions help to relate the sample information. When variables are continuous, data are sometimes grouped, but this loses information and complicates comparisons between several levels of another variable e.g. blood pressure in different ethnic groups, or at different stages of pregnancy.
Whilst not wanting to detract from the information held in the collected data sample, we may want to find a way of summarising the important features to facilitate dissemination of these. Ways of describing or summarising the data are called descriptive statistics. What these aim to do is to give the relevant and useful information without losing any features of importance. Perhaps in an ideal world, all potential users of the information would have time and be capable of taking the raw data and making their own independent conclusions, thereby avoiding being at the mercy of someone else's choice of analysis. In practice, there is usually only the time or space to give/consume limited information and hence it is important that the descriptions given are accurate and we fully understand what they are and their strengths and weaknesses.
The appropriate summaries to use depend on whether the data is categorical or numeric. For numeric data there are several different potential summaries and the most appropriate depends on the distribution of the data.
- Summarising Categorical Data
When data is categorical the values recorded on a group of individuals (or items) can be summarised as the proportions (or percentages) of the total falling into each category.
Note: A proportion and a percentage are different ways of expressing the same information. Neither is better than the other, it does not matter which is used but it is important to be consistent.
The proportion is calculated by taking the number in the category divided by the total. The percentage is obtained from multiplying the proportion by 100.
Rates can also be used; these express the number for some denominator which may relate to people or time. For example, how many per 100 or 1000 individuals or how many times the outcome was observed per year of follow up.
Odds are another alternative, these are the number in the category divided by those not in the category. When the event is rare, the odds and the risk are very similar.
Ref: Lucassen P.L.B.J.; Assendelft W.J.J.; van Eijk J.T.M.; Gubbels J.W.; Douwes A.C.; van Geldrop W.J. Systematic review of the occurrence of infantile colic in the community. Archives of Disease in Childhood, May 2001, vol 84, no. 5, pp. 398-403(6).
In one hospital based study of new-borns followed from birth to 6 months, 68 out of 360 children had infantile colic (as defined by unexplained paroxysms of crying or fussing for at least 90 minutes each day for a minimum of 6 of the preceding 7 days, or periods of 3 hours a day in 3 of the preceding 7 days).
From this sample the proportion of infants with colic is 0.188 and the equivalent percentage is 18.8%
The rate of colic is 0.188 or 18.8 per 100, or 188 per 1000.
The odds of colic is 68/(360-68) = 68/292 = 0.232
Various characteristics recorded as categorical variables are shown in the table below. Not all variables were recorded for all individuals, hence the difference in n (which takes values 771, 772 or 773) between the variables.
- Summarising Numeric Data
There are several ways of summarizing numeric data and the appropriate summary will depend on the distribution of the data. To illustrate the differences, the following example data is used:
Ref: The European Collaborative Study. Children born to women with HIV-1 infection: natural history and risk of transmission. Lancet, 1991; 337: 253-260
CD4 counts, which are a measure of immunity obtained from blood samples, were available from 179 babies enrolled into the European Collaborative Study of Mother to child HIV transmission. The CD4 counts were divided into 3 groups according to whether they were:
(i) Drug-withdrawn at birth (60 children)
(ii) Children of mothers who used intravenous drugs during pregnancy but who were not themselves drug withdrawn (63 children)
(iii) Children of mothers who did not use any intravenous drugs during pregnancy i.e. ex and never drug users (56 children)
CD4 count is a continuous numeric measurement. For the purpose of displaying the distributions the values are grouped and shown in bar charts opposite.
(Note that dot-plots could have been used to show the exact values, but the groupings here are refined enough to show the general patterns).
The first chart show that there was 1 child from the 60 who had a CD4 count between zero and 0.5, a further 2 with values between 0.5 and 1, …, 1 with a value between 7.5 and 8 and the child with the highest CD4 count had a value between 10 and 10.5 etc.
i) Children who were drug withdrawn at birth - 60 children
ii) Children of mothers who used intravenous drugs during pregnancy but who were not themselves drug withdrawn - 63 children
iii) Children of mothers who did not use any intravenous drugs during pregnancy (ex and never drug users) - 56 children
We can see that CD4 count appears to be higher in the first group, - the children who were drug-withdrawn at birth. In order to quantify this difference we could use either the mean or the median. These are described below.
Possibly the most widely known measure of centre or average is the MEAN, sometimes known as the arithmetic mean.
The mean is calculated by adding together all of the measurements and dividing by the total number of measurements.
For example, there are 60 children who were drug withdrawn at birth, to calculate their mean we need to add together the 60 CD4 measurements from these children:
To find the mean of the second group we add together their individual CD4 counts (63 measurements) and divide by 63 (the number in the sample), This yields a mean of 2.668 for this group.
In group 3 where there are 56 children of this type, the total of the 56 measurements is 129.69 and the mean is therefore 129.69 divided by 56 = 2.326.
Most packages (and hand held calculators) have a facility for automatically calculating the mean of a set of values. Within the web link there is an excel spreadsheet which can be used to calculate the mean for a sample of up to 500 values.
Another common measure of centre or average is the MEDIAN. This is the value that falls halfway along the frequency distribution, the 'middle value'. This is the value that falls halfway along the frequency distribution. If the data are sorted into order, from smallest to largest, the median is the number such that 50% of the values are higher than this number and 50% are lower. It is sometimes known as the 50th centile.
If there are an odd number of values, then the median will take one of those values. For example, if there are 21 values then the 11th highest value will have 10 less than and 10 greater than it. Hence the 11th highest value of a set consisting of 21 values will be the median.
If there is an even number of values, then the median will lie halfway between two of those values. For example, if there are 20 values then the median will lie between the 10th and 11th highest values, since 10 measurements are less than this and 10 are greater. Hence the mean of the 10th and 11th highest values in a set consisting of 20 values will be the median. (Add together the 10th and 11th highest values and divide by 2).
Medians for samples of up to 500 measurements can also be calculated using the excel spreadsheet on the web link.
CD4 count example:
The median of the 60 measurements in group (i) is the average of the 30th and 31st values in order of magnitude. The 30th highest CD4 count is 2.9, the next highest (the 31st) is 2.99. Hence the median is the average of these two values = (2.9+2.99)/2 = 5.89/2 = 2.945
The median of the 2nd group, which consists of 63 childrens' measurements, is the 32nd highest CD4 count obtained. There will be 31 CD4 counts lower than this and 31 higher than this in the sample. The 32nd highest value is 2.55 and hence this is the median for this group.
As there are 56 children in the 3rd group, the median for this group will be halfway between the 28th and 29th highest values (ie. the average of these two CD4 counts). The 28th highest value obtained was 2.31 and the 29th was 2.33, hence the median = (2.31+2.33)/2 = 2.32
Comparing the mean and the median
Because the mean is calculated by adding together all of the values and then averaging these, a single extreme value can be quite influential.
For example, suppose that there are 10 hospitals, and 9 of these hospitals each admit 10 patients with severe asthma per year and the final (10th) hospital admits 500 such patients. The mean number of admissions will be (10+10+10+10+10+10+10+10+10+500)/10 = 59 patients per year.
The number of admissions in the 10th (outlying) hospital has a relatively large influence on the mean. The mean of this group of hospitals does not represent any of the hospital admission patterns closely. This is because the distribution of admission is skew or asymmetric.
A skew distribution is a distribution in which the measurements 'tail-off' unevenly in one direction. The converse type of distribution is symmetric where the measurements are 'mirror-imaged' either side of the central (or mean) value.
A distribution of values can be skewed in either direction; values above the median can be more spread than those below or vice-versa. To differentiate between these two directions of skewing, skew distributions are sometimes named according to the direction in which the tail of the distribution (or slower tapering or more spread out side) lies.
For example, the skewed distribution shown on the left below might be termed as negative, left or downward skew because this is the direction of the tail of the distribution. Similarly the distribution shown on the right tails off to the right and would be called either positive or right or upward skew.
When the distribution of a set of data is skew, the mean will be nearer to the tail of the distribution than the median. For example, the numbers of annual average hospital admissions in the 10 hospitals detailed above had a single outlying large value (500 per year from the 10th hospital). The 'tail' is therefore to the right and the distribution is right-skew. The mean value (59 admissions per year) is greater than the median (10 per year).
The greater the skewing, the larger the difference between the mean and the median. For example, the difference in the hospital example was 59-10 = 40 per year. If the 10th hospital was less extreme and had 50 admissions on average rather than 500 then mean of the 10 hospitals would be 14 per year, the median would remain at 10, so the difference between the mean and the median would be smaller (14-10 =4).
The mean and median will be the same if the distribution is symmetric.
CD4 count example:
Group (i): Children who were drug withdrawn at birth
Mean = 3.256, median = 2.945, difference = 3.256-2.945 = 0.311
Group (ii): Children of mothers who used intravenous drugs during pregnancy but who were not themselves drug withdrawn
Mean = 2.668, median = 2.55, difference = 2.668-2.55 = 0.118
Group (iii): Children of mothers who did not use any intravenous drugs during pregnancy (ex and never drug users)
Mean = 2.316, median = 2.32, difference = 2.316-2.32 = -0.04
If we look at the bar charts of the measurements we can see that the 1st group is the most skew with a few babies having relatively large CD4 counts and that the 3rd group is the most symmetric.
In all 3 groups the mean is greater than the median, indicating a tendency towards positive skew amongst the CD4 counts. The difference appears to increase as the drug involvement is greater (i.e., the largest difference of 0.311 is found amongst those babies who were drug withdrawn at birth and hence had the greatest drug exposure, the smallest difference of 0.04 was amongst those with the least drug exposure). The greater skewing for higher levels of drug involvement is as anticipated: when exposed to drugs some babies will show a greater reaction than others, some may not be greatly affected. This variety of responses to the drug exposure leads to a skew distribution of values. For similar reasons, we sometimes observe skew distributions of measurements amongst diseased individuals or those exposed to powerful treatments.
Which to use: Mean or Median?
We use measures of centre to summarise the data and to impart information in a concise way about distributions. For example, we do not wish to relate all the collected CD4 counts to everyone, but we may wish to report that CD4 count tends to be higher amongst children who are drug withdrawn at birth (in this population).
The measure of centre to use, in a given situation, is the one which best summarises the data. For example, the mean of group (i) in the example used is not an adequate summary of that group since it is highly influenced by one or two 'odd' high values. As an overall measure of centre of the group (i), the median is probably a better choice.
The median is always representative of the centre of the data. The mean is only representative if the distribution of the data is symmetric, otherwise it may be heavily influenced by outlying measurements.
The median is not very sensitive to changes in the data. For example, the hospital admissions for severe asthma example introduced on page 72 consisted of the following data:
The median value of 10 would be unaffected if the sample had been:10, 10, 10, 10, 10, 10, 10, 10, 10, 500 on average per year within each of the 10 hospitals sampled
1, 1, 1, 1, 10, 1000, 1000, 1000, 1000, 1000
This is very different; we would not want to interpret both of these samples of admission data in the same way, but the medians make no distinction.
The mean is very sensitive to changes in the data. Because each measure is directly involved in the calculation of the mean, the mean will be affected by a single change in any of the data values.
Because the mean is more sensitive to changes than the median, it is a more powerful summary measure when it can validly be used.
The inferences that can be made from a sample will be greater/more precise/more accurate if distributional assumptions which validate the mean can be made (ie. the distribution is symmetric and hence the mean is a valid measure of centre). However, these assumptions cannot always be made and the mean may give a misleading idea of the data. If the assumptions cannot be made, then the median is a better measure of centre that we know will be representative regardless of the distribution of the measurements.
Measures of spread
With categorical data, the proportion or percentage quantified the distribution completely. For numeric data there is another aspect that must be considered. To illustrate this point, consider the following distributions of 100 measurements, both of which have a means and medians equal to zero:
The first varies variable takes values between -2.55 and 2.26, the second between -21.01 and 26.73. Clearly the means and/or medians do not fully quantify the differences between the distributions.
The second distribution clearly covers a wider range of values and we also need to quantify this. Similarly, the summary statistics (mean and median) show that CD4 count appears to be lowest on average amongst the group (iii). In addition to the differences in overall average, those with greater drug involvement covered a wider range of values. Having summarised the differences in the levels (or centres) of the groups we may wish to summarise the differences in the spread of the values.
The most common measures used to summarise the spread of a distribution are the range, the inter-quartile range and the standard deviation.
This is the difference between the largest and the smallest values of the distribution.
In group (i) the largest CD4 count recorded amongst the 60 children was 10.19 and the smallest 0.39, hence the range = 10.19 - 0.39 = 9.80, The range for of the 63 children in group (ii) = 7.13 - 0.41 = 6.72 and for the children in group (iii), the range = 4.77 - 0.33 = 4.44.
Examination of the frequency distributions indicates that group (i) has the most spread out values and group (iii) the least. The ranges as given above quantify these differences.
One problem with the range is that it ignores the bulk of the data, depending by definition only on the two most extreme (and hence possibly 'odd') values.
The child in group (i) with a CD4 count 10.19 has quite an odd (or extreme) value; the next 3 highest values in that group being 7.99, 7.49 and 5.20. The 'odd value', 10.19, has had a very large effect on the size of the range. If this child had not been sampled, the range would fall to 7.99 - 0.39 = 7.6, a difference of 2.2 (9.8 - 7.6).
The interquartile range
To avoid the false impression that the range may give, we can instead quote a 50% central range. This is the range within which 50% of the sample values lie. To calculate this requires the use of percentiles (sometimes abbreviated to centiles).
A percentile is the value below which a certain percentage of the values in the sample lie. For example, the 20th percentile is the value below which 20% of the observations lie.
The 50th percentile is the value below which 50% of the observations lie. We have already given the 50th percentile a special name: the median.
To find the 50% central range, we need to calculate the 25th and 75th percentiles. The 25% smallest values in the sample will lie below the 25th percentile, and the 25% highest values in the sample will lie above the 75th percentile. The interquartile range is the difference between the 75th and 25th centiles.
A more detailed explanation of the calculation of centiles is given on the web link and there is also an excel spreadsheet for calculating centiles from a set of values.
The interquartile range does not depend on the oddest or extreme values like the range does and is therefore a more stable summary measure. The interquartile range quantifies the range within which half of all measured values lie and is hence more representative of the majority than the range.
CD4 count example:
There are 60 counts in group (i). The 25th centile lies between the 15th and 16th highest values and can be calculated to be 2.185; the 75th centile lies between the 45th and 46th highest values and can be calculated to be 4.02. Hence the interquartile range = 4.02 - 2.185 = 1.835
The interquartile range for groups (ii) and (iii) are 1.72 and 1.31 respectively.
Comparing the interquartile ranges shows that the children without drug involvement (group (iii)) are less varied in their responses.
The standard deviation
Both the range and the interquartile range are relatively insensitive to changes in the data. They make no distributional assumptions and are comparable to the median in this respect.
When measures of centre were considered, the mean was found to be highly sensitive to data changes and to make major distributional assumptions. It was found that the inferences that could be made from a sample would be greater/more precise/more accurate if distributional assumptions which validated the mean could be made (i.e., the distribution is symmetric and hence the mean is a valid measure of centre). A measure of spread that corresponds to the mean (i.e., assumes symmetry and is very sensitive to changes in the data) is the standard deviation.
The standard deviations of groups (i), (ii) and (iii) are 1.72, 1.35 and 1.00; this again shows the increasing variability of measurements with greater drug involvement.
There are 5 stages to calculate the standard deviation (this section can be omitted if preferred):
1) The difference between each value and the mean is calculated (so for our sample of 60 drug withdrawn babies there will be 60 differences between each individual child and then mean of all 60 ie. 3.256). If the values are very spread out there will be many large differences, if values are tightly grouped then all differences will be small.
2) Some measurements will be greater than the mean and some less (by definition). Hence if we take each value minus the mean, some differences will be positive and some negative. The sign (positive or negative) of the differences is not important, we are interested in how far values are from the mean, not necessarily in which direction. Hence the next stage is to square all differences to make them all positive. If the values are very spread out then there will be many large squared differences, if they are tightly grouped then the squared differences will be relatively small.
3) The squared differences are added together. If the values are very spread then this will yield a large number, if the values are tightly grouped (hence we are adding together relatively small squared differences), then the sum of the squared differences will be quite small.
4) The summary measure of spread may be used to compare different groups (for example the 3 different groups of babies in the CD4 count example). The summary measure should not be larger in one group than another purely because more squared differences have been added together. For example, if we were to compare a set of 20 diseased patients with very variable measurements with 2000 control children with all values within a very tight range, then the total of the 2000 squared differences in the control group may be larger than the sum for the 20 diseased patients even though the measurements are less variable.
Hence, the dependence of the sum of the squared differences on the sample size is removed by dividing the total by the number in that sample minus 1 (i.e., divided by 19 for the diseased patients introduced in the previous paragraph and by 1999 for the 2000 control subjects).
The quantity obtained is known as the variance.
5) Whilst the variance works well as an overall summary of spread in certain circumstances it is based on squared measurements. Generally we prefer to use the square root of the variance, which is known as the standard deviation.
Comparing the range, interquartile range and standard deviation
As stated earlier in this chapter the range and the interquartile range are much less sensitive to changes in the data than the standard deviation. If a single value changes then the standard deviation, by definition, will also change. Hence the standard deviation is a more powerful summary measure as it makes more comprehensive use of the entire dataset. However, situations when the mean might not be an appropriate measure of centre were discussed previously. If the mean is not a meaningful summary of the centre of the data, then it follows that the standard deviation, which is calculated from distances around the mean, will not be a useful summary of the spread of the values.
Therefore, if distributional assumptions (data is symmetric) can be made and there are adequate numbers in the sample to check those assumptions (as a rule of thumb it is often said that a sample size of at least 20 would be adequate), then the mean and standard deviation should be used to quantify the centre and spread of the measurements.
Alternatively, if the data distribution is skew and/or the sample size is small then it is preferable to use the median and interquartile range to summarise the measurements.
- The Normal Distribution
Some shapes of frequency distribution have special names since they are quite common. The best known and most useful is called the normal distribution.
The standard deviation is a particularly useful measure if the data is normally distributed.
The normal distribution is symmetric and bell-shaped:
Many variables follow this distribution
Fig: A distribution of heights in young adult males with an approximating normal distribution (Martin, 1949, Table 17(Grade 1)).
Fig: A distribution of diastolic blood pressures of schoolboys with an approximating normal distribution (Rose, 1962, Table 1)
Log (serum bilirubin) values in 216 patients with primary biliary cirrhosis:
The normal distribution is by definition symmetric with most values towards the centre and with values tailing off evenly in either direction. Because of the symmetry of the distribution, the mean always lies in the centre of the distribution (where the peak is). The standard deviation of the distribution tells us how spread the values are around the mean.
As the mean and standard deviation change, the distribution may alter its' position on the horizontal axis, become 'taller and thinner' or 'shorter and fatter':
Effect of reducing mean:
Effect of decreasing standard deviation:
95% of the values in the distribution will lie within a range ± 1.96 standard deviations either side of the mean, 68% will lie within ± 1 standard deviation.
Since 1.96 is so close to 2, it is common for ease of calculation to construct the interval ± 2 standard deviations either side of the mean and this will contain approximately 95% of the values.
Suppose the heights of 5 year old children have a mean 100cm and standard deviation 10cm and that the heights are known to be normally distributed. Approximately 95% of the values will lie in the interval ± 2 standard deviations (=2 x 10 = 20cm) either side of the mean (100cm). This interval is (100 ± 20) = (80, 120 cm). Approximately 95% (or 19 out of every 20) of 5 year olds lie within this height range. The remaining 5% (or 1 in 20) will be either shorter than 80cm or taller than 120cm. The symmetry of the distribution means that 2.5% (1 in 40) will be shorter than 80cm and the other 2.5% (1 in 40) will be taller than 120cm.
These calculations are based on the Normal Table (see the next section).
- Normal Tables
It is possible to quantify the exact percentage of values that lie within a selected number of standard deviations either side of the mean. The range is from 0% at 0 standard deviations either side up to 100% for an infinite number of standard deviations either side. The greater the number of standard deviations, the greater the percentage contained. The percentage contained can be automatically calculated using the excel spreadsheet within the web link.
The table below shows the percentages contained within intervals for selected numbers of standard deviations either side of the mean. The table also gives the proportions excluded on each side. The last two columns are readily calculated from the information given in the '% contained in interval column'. For example, the table shows that 99% of the values lie within the interval (mean ± 2.58 standard deviations), this implies that 1% is outside this interval, and this will be 0.5% either side. In the table the 1% and 0.5% are expressed as proportions (0.01 and 0.005 respectively).
NOTE: At this stage p-values have not been introduced and the headings for the last two columns ('1-sided p-value' and '2-sided p-value') may be meaningless. The table given above will be referred back to in later chapters and these headings should then be useful. The spreadsheet linked below can be used to obtain these values.
Using the normal table - examples
(1) 99.8% of the values of a normal distribution lie in the interval (mean ± 3.09 s.d.). 0.001 or 0.1% of the values are greater than the value (mean + 3.09 s.d.). A further 0.001 or 0.1% of the values are less than (mean - 3.09 s.d.). In total, 0.002 (0.001+0.001) or 0.2% of the values lie further than 3.09 s.d. away from the mean.
The probability of randomly choosing a value at least 3.09 standard deviations above the mean is 0.001. Similarly, the probability of choosing a value at least 3.09 standard deviations below the mean is also 0.001. The probability of choosing a value at least 3.09 standard deviations away from the mean is 0.002.
(2) The diastolic blood pressure of schoolboys is normally distributed with a mean mmHg of 58.5 and standard deviation 9.7 The table tells us that 0.005 of the distribution values are greater than the mean + 2.58 s.d. I.e. 0.005 of the schoolboys will have diastolic blood pressures greater than 58.5 + (2.58 x 9.7) = 83.53 mmHg. Similarly 0.005 of the schoolboys will have diastolic blood pressures less than 58.5 - (2.58 x 9.7) = 33.48 mmHg. So:
(i) 0.99 (= 1-0.005-0.005), or 99%, of the boys will have blood pressures in the range (33.48, 83.53 mmHg).
(ii) The probability that a schoolboy chosen at random has a blood pressure greater than 83.53 mmHg is 0.005. The probability that a schoolboy chosen at random has a blood pressure more than 25.3 (= 2.58 x 9.7) mmHg away from the mean, 58.5 mmHg, is 0.01 (= 0.005 + 0.005).
Notice that exactly 95% of the values in a normal distribution lie in the interval ± 1.96 standard deviations either side of the mean. Approximately 95% lie in the interval ± 2 standard deviations.
If a variable is known to be normally distributed then we only need to be given the mean and standard deviation of the distribution to have a complete picture of what values of the variable exist and their relative frequencies.
Faced with a particular distribution we may not know whether or not normality can be assumed. In the next section, the assessment of whether or not a distribution of values is normally distributed is discussed followed by a section on the transformation of non-normal distributions to normality.
- Non-Normal Data
There are a variety of statistical tests available to formally test normality.
However, it is NEVER possible to prove that a variable is normally distributed, only to show that the sample data is compatible with normality. The formal tests available answer the question:
"Could this data have come from a normal distribution?"
Rather than what we wanted to know, which was…
"Does this data come from a normal distribution?"
- The data may also be compatible with other, distinctly non-normal, distributional shapes.
- Formal tests of normality depend more on sample size than distributional shape.
- Simpler methods are to be preferred.
Simple methods are:
i) To examine a bar chart, or histogram, of the data.
ii) Use the mean and standard deviation of the data value to construct the interval within which 95% of the values would be expected to lie if the data were normally distributed. (i.e., the interval (mean ± 1.96 standard deviations)).This interval should exclude approximately 2.5% of the sample values at either side, if it does not, and/or if the interval has limits that are unfeasible (for example negative ages), then this implies that the data is non-normal.
For example: Shown is the barchart of the serum bilirubin measurements in 216 patients with primary biliary cirrhosis:
The distribution is distinctly upwardly skew. Putting all 216 values into an appropriate statistical package, the mean value is calculated as 56.43 and the standard deviation 64.018
If the data were normally distributed then the interval bounded by (mean ± 1.96 sd) would contain about 95% of the sample values. ('About' because there may be some sampling variability - there would be exactly 95% within this interval in the population but a sample of 216 might yield slightly less or slightly more.)
The interval limits are calculated as:
Mean - 1.96sd = -69.045 and Mean + 1.96sd = 181.905
If the data were normally distributed we would expect approximately 2.5% (or 5.4) of the values to be less than -69.045 and a similar number to be greater than 181.905. Most of the values should lie in the interval (-69.045, 181.905). No values can be lower than zero and there are more than 6 with values above the upper limit. The lower limit of -69.045 is biologically unfeasible. These findings indicate that the data is non-normal and, in fact, upwardly skew.
It is not uncommon to find non-normal data summarised using the mean and standard deviation. Some statistical tests are invalid if the data samples are non-normal.
When reviewing publications that have data summarised as the mean and standard deviation, if the interval (mean ± 2 sd) has unfeasible limits then this may indicate that the statistical tests subsequently performed are invalid.
Ref: Rautanen T et al, Randomised double blind trial of hypotonic oral rehydration solutions with and without citrate, Archives of Disease in Childhood, 1994; 70, 44-46.
In the first table, age, duration of vomiting and weight loss have unreasonable lower limits for the interval (mean±2sd). For instance, the average age of the citrate ORS group is 13.5 months with a standard deviation of 6.9 months. If the ages are normally distributed, then we expect about 95% of the patients to have been aged between 13.5-2(6.9) and 13.5+2(6.9) i.e. between -0.3 and 27.3 months. Clearly an age of -0.3 months is not possible. The implication is that these variables are upwardly skew and the chosen means of presentation and analysis are invalid.
In the second table, weight increase and the durations of vomiting, diarrhoea and stay also appear to be upwardly skew and inappropriately presented.
Upward skew distributions of values are far more common than downward skew. This is because there is often a lower limit (usually zero) below which values cannot fall. What was found in the above example fits in with a plausible clinical scenario. For instance, patients could not have a negative duration of stay but it is likely that most people only stayed for a relatively short period with a few staying much longer who formed an upward tail to the distribution as well as heavily influencing the mean.
Sometimes non-normally distributed data can be transformed to normality.
The transformations used should not change the relative ordering of the values but alter the distance between successively ordered values to change the overall shape of the distribution.
For example:If a dataset is transformed by squaring each of the values the larger values will be pulled further apart than the smaller values.
- There is a difference of 1 between 2 and 3 prior to transformation; after squaring the measurements 2 becomes 4 and 3 becomes 9 and the difference between the transformed measures is 5 (9-4).
- There is a difference of 1 between 10 and 11 prior to transformation; after squaring the measurements 10 becomes 100 and 11 becomes 121 and the difference between the transformed measures is 21 (=121-100)
After transformation the higher measurements (10 and 11) are more apart than the smaller (2 and 3).
Squaring data values can therefore be used to normalise downward skew data (by pulling apart the higher measurements an upward tail is created to match the downward skew and hence give a normal distribution).
There are a variety of transformations that can be used to correct for skewing to a greater or lesser extent. The correct transformation to use will depend on both the direction and extent of skew. It is possible to over-correct by using too powerful a transformation and change the direction of the skew. For example, a small amount of downward skewing might be over-balanced by squaring the measurements and result in an upward skew distribution.
Tukey's ladder of transformations (shown below) gives several common transformations to correct skew in each direction and illustrates the relative effectiveness of these.
For example, the ladder shows that squaring corrects downward skew and that cubing the data gives an even stronger correction; i.e., if we cube rather than square the values then the right hand (higher) values are pulled apart even more, creating a more extreme upper tail.
Upwardly skew data is not uncommon in medical applications and many measurements which display upward skewing are what is known as 'lognormally distributed'. When data is lognormally distributed, taking logarithms (or logs) of the data values will normalise the data.
Serum triglyceride values in cord-blood are lognormal:
Choosing a suitable transformation can be a matter of trial and error. Logging corrects upward skew; if data is downwardly skew then logging will make the skewness worse. Downward skew may be corrected (to varying extents) by squaring, cubing or anti-logging.
Often it is possible to use a transformation that has some biological basis. For example, taking square roots of areas or cube roots of volumes may be effective. Taking logarithms may not seem intuitive but this transformation is particularly useful when there are different groups to be transformed and compared. The particular properties of logarithmic transformation are illustrated later in this course.
All of these transformations change the magnitude of the data values, some more than others, to reduce the skew. Note that they never change the relative ORDER OF THE VALUES.
Some distributions show skewing so extreme that a large percentage of measurements are at one of the extremes.
For example, psychological tests often consist of a rating scale, 'normal' people being expected to score zero on the scale, higher scores indicating deviances from 'normal' behaviour or emotions. It is not uncommon to collect a sample which consists mostly of zeros, or ones, ('normal' people). The result is what is known as a J-shaped distribution.
The J-shape may be skewed to the left or the right depending on whether the majority of measurements are at the lower or upper extreme.
Examples of each type are shown below:
Ref: Thornton A, Morley C, Green S, Cole T, Walker K and Bonnett J, Field trials of the Baby Check score card: mothers scoring their babies at home, Archives of Disease in Childhood, 1991; 66: 106-110.
Figure: Profile of daily scores (n=701). The numbers of babies with each score are shown at the top of each column.
Most babies are completely healthy (implied by a score of 1) and this is shown by the majority of measurements at the lower extreme.
Ref: Tibrewal S and Foss M, Day care surgery for the correction of hallux valgus, Health Trends, 1991; Vol 23 No. 3: 117-119.
Figure: Linear analogue scores for pain relief after Wilson's Osteotomy.
Most individuals had 100% (full) pain relief after the operation.
Since transformations never change the order of the sample, any transformation of a J-shaped distribution will still be J-shaped. The extreme measurements will all transform to the same new value, and will always be at one extreme of the transformed sample.
J-shaped distributions cannot be transformed to normality.