UCL Great Ormond Street Institute of Child Health


Great Ormond Street Institute of Child Health


Chapter 2: Types, Storage, and Graphical Displays of Data

The idea of a research study is to generate data that should answer the research question being posed. Sometimes a lot of data is generated. It is important that the data collected is stored and used efficiently and it is unethical not to do this.

It may be tempting to dive straight into statistical analyses of data as soon as it is collected. However, there are 3 things which should be considered prior to such analyses and these are covered in this chapter. Depending on the form of the study and the nature of the data, these areas may be of great importance or of no relevance in a particular scenario. Each is discussed in turn and provides a backdrop for the future sections/subsequent analyses. The 3 things to be considered are:

  1. An introduction to the different types of data; it is important to establish the type of data collected in order to identify the correct means of summarizing, displaying, and analysing that data.
  2. The storage of data, which must be considered at study onset as there may be major implications for the time needed and practicalities of performing the study. The transference of data from individual respondent to the final analyses may not be a trivial exercise. The protocol for dealing with individual pieces of data must be clearly established before the study has commenced.
  3. Finally, this section considers ways of initially viewing/displaying the collected data. It is important to 'get to know' the data before starting any analyses. Summarising the data as frequency distributions, tables and graphs will help to identify trends (which may or may not have been expected) that may inform the subsequent analyses. Outliers may be highlighted and/or distributional tendencies (e.g., skewness) which could invalidate or overly influence results may be identified at this stage.
Types of Data

A variable is something that varies between individuals or items.

Information is collected from the study groups by recording each individual's values for the relevant variables. For example:

In a study to assess the effect of giving recombinant human erythropoietin to premature infants,

  • haemoglobin level and
  • number of transfusions required

…may be relevant variables to record for infants in the treatment and placebo groups.

Variables can be classified according to their type. It is important to consider the variable type when deciding on how to display and analyse the study results. In this section, the types of variable are defined and discussed:

Variables are either categorical or numeric. A description of each is given below.

Categorical variables

Individuals are classified into one of several categories. For example:

Blood group, which is A, B, AB or O

Depending on the number of categories and whether there is an ordering to them, the variable is either, binary, nominal, or ordinal:

Binary variables

If there are only two categories, then the variable is known as binary (or dichotomous). Binary variables are very common. For example:

  • Yes/no responses
  • Female/male
  • Low/normal birthweight

Ordinal variables

If there are more than two categories and the categories have an obvious order, then the variable is ordinal. For example:

  • Social class (1,2,3a,3b,4,5)
  • Pain (none/mild/moderate/severe)
  • The faces pain rating scale

Wong-Baker Faces Pain Rating Scale

With ordinal variables it is not clear that the difference between 1 and 2 is the same as between 2 and 3, or 3 and 4, etc. For example, for the faces pain rating scale below, we know the 2 is less happy than 1, 3 is less happy than 2 (and also therefore, less happy than 1), but not that the degree to which their happiness differs is the same.

Nominal variables

Categorical variables which are neither binary nor ordinal are nominal. For example:

  • Ethnic group (Caucasian/Asian/Afro-Caribbean)
  • Marital status (married/ single/divorced/separated/widowed)

Numeric variables

A number describes each individual's value. For example:

  • Number of transfusions
  • hemoglobin level

Numeric variables are either discrete or continuous, although the distinction is only important if the variable is discrete with relatively few values:

Discrete variables

If the numeric variable can take only a distinct number of values, usually complete integers (0, 1, 2, 3, ...) then it is known as discrete. For example:

  • Age in years
  • Parity
  • Number of visits to clinic

Continuous variables

In theory, continuous variables can take any value within a certain range. In practice, the possible values the variable takes may be restricted by the accuracy of the recording device. For example:

  • 'Exact' age (usually meaning age to the nearest day or month)
  • Blood pressure
  • Head circumference

Changing the form

There is overlap between the different types of variable. For example:

  • There may be dispute about whether parity (which usually takes only the values 0, 1, 2, 3, 4, and 5) is discrete numeric or ordinal.

Data are sometimes rearranged to deliberately change it from one type to another. For example:

  • Blood pressure (continuous) is often re-classified as hypertensive/normotensive (binary)
  • Birthweight (continuous) may be divided into those less than 2500g and those 2500g or more to give low and not-low birthweight (binary)
  • Number of clinic visits (discrete) may become 0, 1-5, 6-10, 11+ (ordinal)

This rearrangement can happen at either the collection or the analysis stage.

It is NEVER advisable to reduce data at the collection stage. During analysis, keep variables in their original form unless there are strong medical reasons for not doing so.

The statistical analysis of continuous data is more powerful and often simpler.

Data Storage

The process whereby data is collected from individuals, transferred to electronic format, and subsequently stored, requires consideration prior to data collection commencing. We need to ensure that the correct information will be available in the correct format for analyses. The importance of using an efficient and well thought out process should not be under-estimated. It is not uncommon for researchers to omit giving any serious thought to the final product until analysis is due to begin, perhaps in the final months of a project lasting several years. At this stage, rectifying badly designed systems may be very time consuming.

Each study is unique and the best way to collect and transfer data will vary according to local needs and resources. There are however some basic guidelines which can be used uniformly. These are:

1. If there is any doubt about the collection to electronic form transfer process, a few cases should be piloted to determine a realistic schedule/process.

2. Ensure there is a written protocol which details the process that will be used and how any deviations from this are to be recorded and/or dealt with.

3. Establish and adhere to local data protection requirements. This may involve removal of identifying features from individual patient data, the use of password protected databases and/or restrictions to off-site access.

4. Ensure that any database stores information in a format ready for the subsequent analyses that are envisaged. This will include: Storing most variables in 'numeric' as opposed to 'string' format Having a suitable coding system for missing data (depending on the database used for storage, this may be some otherwise unused value - such as 999, or leaving blank cells in the database).

(ii) Having a suitable coding system for missing data (depending on the database used for storage, this may be some otherwise unused value - such as 999, or leaving blank cells in the database).

(i) Storing most variables in 'numeric' as opposed to 'string' format

5. Use helpful variable names and codes. For example: The variable 'GENDER' coded 1-male, 2-female, is more usefully named 'MALE' and coded 1-male, 0-female (or 'FEMALE' coded 1-female, 0-male), as this will be more informative in the subsequent analyses. It also negates the need to keep having to remember which code (1 or 2) meant what.It is also best to code any binary variable as 0 & 1 (rather than 1 & 2) as this will be better for any regression models that are built.

(i) The variable 'GENDER' coded 1-male, 2-female, is more usefully named 'MALE' and coded 1-male, 0-female (or 'FEMALE' coded 1-female, 0-male), as this will be more informative in the subsequent analyses. It also negates the need to keep having to remember which code (1 or 2) meant what.

(ii) It is also best to code any binary variable as 0 & 1 (rather than 1 & 2) as this will be better for any regression models that are built.

6. Where possible build data checks in to the entry process. For example: If a gestational age of 50 weeks is entered, will this be automatically flagged up as a potential problem? If the database allows checks to be undertaken automatically then this is best. Otherwise, is it possible to have a series of checks run after each batch of data is entered (perhaps by looking at the distribution of gestational ages at regularly time points)?  

7. Double entry validation of all data should be employed where possible.

8. Ensure that any necessary transfer of data from a database to statistical analysis package will go smoothly. It is better to test this out with a few cases than to find out, near the project end, that there are problems. In particular:

(i) Are all numeric variables transferred in proper numeric format with all decimal places etc. intact?

(ii) Do missing values transfer properly?

(iii) Are variable labels transferred, or will these have to be re-entered?

(iv) Are there variable/label name restrictions in the package being transferred to that might mean some information is lost? (e.g., if the stats package only supports variable names up to 8 characters or certain symbols, but the database contains longer variable names, some of which may start with the same 8 characters and so be indistinguishable in the software package, or unrecognised symbols, which usually result in the software package containing a series of unhelpful names like 'var1', 'var2' etc.).

9. Check that the layout of the final dataset will be suitable for the analyses envisaged. The appropriate layout may be dependent on the statistical package to be used. For example, if blood pressures are to be compared between treatment and placebo groups using a formal statistical test does the data need to be laid out:

(i) as one column denoting group (treatment/placebo) and another for blood pressure, or

(ii) as one column for blood pressures in the treatment group and another column for blood pressures in the placebo group

Note that scenario (i) has two equal sized columns; scenario (ii) may have columns of different lengths.

If serial measurements are made of the same variable for each individual, should these be stored as?

(i) one column for each measurement time, or

(ii) one row for each measurement time within each individual

one column for each measurement time, orone row for each measurement time within each individual

In this case, for scenario (i) the dataset has one row of data per individual, whereas scenario (ii) has much longer columns (but fewer of them) and multiple rows per individual.

10. There should be a designated database manager who is responsible for making sure that the data is cleaned and stored in a suitable and accessible format.

Graphical Displays of Data

The presentation of data is an extremely important part of any study. The purpose of graphical displays and tables is to impart information to the reader in a more easily digestible form than the raw data. The best method of presentation of the will depend on:

  1. The number of observations in the sample(s)
  2. The number and type of variables to be displayed

Ways of displaying data are presented within this section. Descriptions and discussion of the most commonly used modes of display are given. The first section deals with displays for a single variable. Next the joint display of pairs of variables and the relationship between them is covered. Finally there is the discussion of the co-presentation of more than two variables.

Emphasis throughout is on imparting the relevant information to the reader in an understandable fashion without losing any important features of the data.

These notes are meant as a guide and to give ideas rather than to be completely prescriptive. There are no hard and fast rules, each dataset and situation must be individually considered along with the proposed audience.

Graphical Displays: One Variable

The most common displays for the values collected on a single variable are considered. The final section discusses the relative merits of these displays.

ONE VARIABLE - Frequency distributions and bar charts

The frequency distribution shows the frequency with which the values of a variable are distributed amongst the individuals.

Here is an example of a categorical variable assessed in a group of individuals and the results displayed in this way:

Ref: Frame S, Moore J, Peters A & Hall D, Maternal height and shoe size as predictors of pelvic disproportion: an assessment, British Journal of Obstetrics and Gynaecology, Dec 1985, Vol. 92, 1239-1245.

The variable being displayed here is the mode of delivery at birth ('type of delivery'):


Bar charts are commonly used to display frequency distributions. The heights of the bars show the frequency with which observations fall within certain intervals. Here is the same data shown as a bar-chart


None of the relevant information is lost with either of these displays. If the information is given in either of these formats, as opposed to giving the list of categorisations obtained in their raw form, then the only thing that cannot be determined is the order in which the 351 patients presented. In this instance, the order in which the women presented is probably unimportant so that this loss is not of concern.

The bar chart possibly makes the information given more easily digestible in that it illustrates the relative frequency of the birth types graphically. Therefore, the bar chart may be preferred to the frequency distribution as a means of displaying this data.

The bar chart can sometimes be made more effective as a means of display by ordering the categories according to frequency. For example:

Ref: O'Connor M, Europe and nutrition: prospects for public health, BMJ 1992; 304, 178-80.


ONE VARIABLE - Pie charts

A pie diagram shows the proportion of a sample falling into each of the categories of a categorical variable. The proportions are shown relative to one another by dividing a circle accordingly and naming the segments. For example, in the first pie diagram below, 46% of the 200 patients diagnosed with HIV presented initially with pulmonary TB, therefore they are depicted by a segment encompassing almost half the circle.

(1) Ref: Farmer P eg al. Community-based treatment of advanced HIV disease: introducing DOT-HAART (directly observed therapy with highly active antiretroviral therapy). Bulletin of the World Health Organisation, 2001, 79(12) 1145-1151.


Presenting diagnoses in 200 patients with HIV disease Clinique Bon Sauveut, Haiti, 1993 -1995

(2) Ref: Platt and Pharoah, Child health statistical review; 1996; Arch Dis Childhood, 1996,75, 527-533


NOTE: Pie diagrams are useful only for displaying data collected on NOMINAL variables. If there were an order to the categories then it would effectively be lost with this type of display. 

ONE VARIABLE - histogram

For numeric variables the frequency distribution of the individual values is not very helpful. It is more useful to categorise the values and to report the frequencies for each category. For example, the heights of 346 of the 351 women in the previous example were also recorded. If the heights are separated into ordinal categories, then these can be displayed in the following frequency distribution:


*In 5 cases height was not recorded

Using these frequencies, we can create a graphical summary similar to a bar chart. However, this graph is known as a histogram:


ONE variable - Bar charts vs. Histograms

Bar charts are usually used to show the frequencies of a categorical variable. Therefore, the categories in a bar chart can often be reordered, separated or grouped together. The height of each bar shows the frequency within each category.

Histograms are used to show the distribution of a numeric variable. Histograms usually separate the variable into ordinal categories. These categories are often called bins.

Histograms usually separate the numeric data into categories with an equal range (just like in the histogram above where each height category has a range of 3cm). However, this is not always the case, as we will see in the following example.

The ages of 85 pediatric patients presenting for cardiac operations are given in the table below as both the number per category and also the number per month within category:-


Age in the first column is grouped 0-1 month, 2-6 months, 6-12 months, 1-2 years, 3-5 years, 5-15 years, and 16+ years.

The number per month within category shows a pattern of declining frequency with age which is harder to see within the number per category display.  A bar chart would consider the number per category with each category being given equal width in the display:


In forming a histogram the width of the bar is changed in proportion to the range of the continuous variable being covered by that particular bar. It is now the area of the bar which shows the frequency with which observations fall within certain intervals. 

A histogram of the data considers the frequency per month. In effect this is giving the bars their representative distance on the horizontal axis (ie. the bar for the 10 babies in the 0-1 month category, which covers only one month, is stretched upwards to be tall and thin; the 9 patients within the 15-20 years category, which spans 5 years, is stretched out across a width 60 times wider (12 x 1 month x 5 years) to make it short and wide:


The bar chart masks the true nature of the distribution. The histogram clearly shows that the majority of operations occurred in children of less than one year of age, there were few in those over two years. This pattern is less readily identified from either the frequency table or the bar chart. 

ONE VARIABLE - Dot diagrams

A useful and under-used display of continuous numeric data is the dot-diagram. One point represents the value for each individual or item, no information is lost (apart from the order of collecting measurements and outliers are clearly highlighted. For example, below are the times taken for a scorpion to capture its prey (a cricket) in a laboratory experiment.

scorpion dot plot

Ref: Edmunds MC, Sibly RM. Optimal sting use in the feeding behavior of the scorpion Hadrurus spadix. Journal of Arachnology. 2010 Apr;38(1):123-5.

Capture time. Jitter has been applied to x coordinates to make overlapping points visible. 

Summary of displays for one variable

The previous 5 types of display (frequency distribution, bar-chart, histogram, dot-diagram and pie-chart) are useful when only one variable is being considered.

The type of variable may influence the appropriate means of display:

*There is no point in using a graphical display of data if it does not impart information in a more easily digestible form than the frequency distributions. It is therefore not recommended that binary data is displayed as a bar-chart or pie-chart and this may also apply to ordinal and nominal data with few categories.

Ref: Platt and Pharoah, Child health statistical review; 1996; Archives of Disease in Childhood, 1996, 75, 527-533.

The text states that "In the 927 patients studied, 649 were male (70.1%) and 278 female (29.9%)". The bar chart does not give any more information than the sentence in the text which references it. It is very wasteful of space, there must be more important aspects to this study than the excess of males.
Graphical Displays: Two Variables

We may wish to display two variables and their associations for a group of individuals. For example, considering whether there are differences in a particular variable between two groups (diseased/disease-free or treated/placebo). The method of display depends on the nature of the two variables.

TWO VARIABLES - Two numeric variables

If we are interested in displaying the levels of two numeric variables and their association within individuals then a scatterplot of the data should be produced unless the size of the sample is so large, with potential duplicates, that this display is unwieldy. 

The values of both variables for each individual are represented by a point on the plot. The individual values can be read from the plot and an idea of the relationship between the variables across individuals is obtained. Even if the plot is not used in the final presentation, it may highlight outliers and will help to indicate the appropriate form of analysis to use. 

For example, the plot below comes from a study from the BMJ, looking at the association between age and ear length. They found a weak positive association, meaning higher values of 'age' are associated with higher values of 'ear length' (the article was published in the Christmas edition, where more light-hearted articles are encouraged).

Ref:- Heathcote JA. Why do old men have big ears? BMJ. 1995 Dec 23;311(7021):1668. 

Scatter Plot Age vs. Ear Size
Sometimes the same variable is measured twice on the same individual under different circumstances, or at the same time in different eyes/arms/kidneys etc. For example:

(1) The change in blood pressure may be recorded whilst using a new treatment and again whilst using the standard treatment.

(2) Pupil dilatation may be assessed in each eye after different drops have been placed in the left and right eyes. 

Alternatively, the same variable may be measured in a matched pair of individuals. For example: 

(1) Reading ability at age 5 assessed for a child born at less than 30 weeks gestation and a term control of the same sex and social class in the same school class. 

(2) Heights may be compared between children with cystic fibrosis and age and sex matched children without disease.

Any 'pairing' that is inherent because of the way in which the data was collected should be retained in both displaying and analysing the data. 

Line diagrams are often used whereas a scatterplot is generally more appropriate. For example:

Ref: Milliner et al, Results of long-term treatment with orthophosphate and pyridoxine in patients with primary hyperoxaluria, New England Journal of Medicine,1994;Vol 331,No.23,1553-8.

Measurements were made of calcium oxalate inhibition in 12 patients, pre and post treatment. The authors displayed the data as a line-plot. The values for each patient are shown pre and post treatment and are joined by lines to show the within person pairing of the measurements. 

Calcium Oxalate Inhibitor Line Plot

Inhibition of the formation of calcium oxalate crystals during treatment with orthophosphate and pyridoxine in 12 patients with primary hyperoxaluria.

The line plot shows that all individuals have values that rise during treatment. One individual shows a very large increase from about 25 pre-treatment to about 145 during treatment (as illustrated by the steeply rising diagonal line).

The same data is presented below as a scatterplot:

Calcium Oxalate Inhibitor Scatter Plot
The line of equality (no change in values pre to during treatment) is shown as a dashed line on the display. All points lie above the line of equality showing that values rose for each individual.

Whilst the same information is given by the two displays, the scatterplot uses only one point to represent each individual compared to 2 points and a line for the line diagram. The line diagram may be confusing to assess if there are changes in various directions, the scatterplot (with the line of equality superimposed if necessary) is easier to interpret. 

TWO VARIABLES - One numeric and one categorical

Dot-diagrams can be used. No information is lost, the display clearly shows the relationship between the variables and also highlights possible outliers. 

We saw an example earlier of the times it takes for a scorpion to capture its prey presented as a dot plot. This example can be extended to include a categoric variable (low/medium/high prey activity):

(1) Ref: Edmunds MC, Sibly RM. Optimal sting use in the feeding behavior of the scorpion Hadrurus spadix. Journal of Arachnology. 2010 Apr;38(1):123-5.

Scorpion Dot Plot with Levels of Prey Activity

Dot plots can also be used to look at the differences between the distributions of groups. In the example below, E coli specific SigA values are typically lower and also less spread out in the 'White UK' category.

Dot plots can be used to look at whether values in one group are typically different from values in another group. In the example above, the plot shows it typically takes slightly longer for a scorpion to catch a prey with low activity than high activity.

(2) Ref: Nathavitharana KA, Catty D and McNeish AS, IgA antibodies in human milk: epidemiological markers of previous infections? Archives of Disease in Childhood,1994;71,F192-7.

Dot Plot Antibodies in Human Milk

Figure: Percentage of diarrhoeagenic Escherichia coli O antigen specific SIgA antibody levels as a fraction of the total SIgA in the three groups. Horizontal bars denote medians for each group. Sri Lankan and Asian immigrant women had significantly higher values (p<0.0001 and <0.001) than their white controls in the UK.

TWO VARIABLES - Two categorical

Where two categorical variables and their relationship are to be shown, a contingency table of the data can be created. For example:

(1) Ref: Thornton AJ, Morley CJ, Green SJ, Cole TJ, Walker KA, Bonnett JM, Field trials of the Baby Check score card: mothers scoring their babies at home, Archives of Disease in Childhood, 1991; 66: 106-110.

Mother's social class table
The table below shows how social class varied between the two areas of the baby check scoring system. In both areas the mothers were mostly from social class III manual.

(2) Ref: Morley CJ, Thornton AJ, Green SJ, Cole TJ, Field trials of the Baby Check score card in general practice, Archives of Disease in Childhood, 1991, 66: 111-114.

This table shows how illness severity was related to baby-check score. We can see the association of increasing severity with increasing score.

Baby Initial Illness Severity Scale

The initial impression was not recorded for two babies.

(3) Ref: Ahmed et al. Neonatal morbidity and care-seeking behaviour in rural Bangladesh. Journal of Tropical Pediatrics, 2001, 47, 98-105.

Amongst other things, the first table below shows how medically unqualified practitioners were used most often for all recorded forms of morbidity and for more than one in three skin rashes no care was sought. In the second table we see that care from the district hospital appears to be the most expensive option, followed by private practitioners and village doctors.

Neonatal Morbidity Tables
Three dimensional bar-charts can be used to show the numbers in each section of the table. However, whilst these may look quite impressive, they do not generally make interpretation any simpler and may even 'lose the numbers'. For example:

Ref: Cardia et al, Outcome of craniocerebral trauma in infants and children, Childs Nerv. Syst., 1990; 6: 23-26.

3D bar chart

In the first group (0-5), the difference in number between males and females was only slight (22% males 18% females) or a ratio of 1.2: in the other groups: however, the difference was greater (6-10 years: 20.7% m:6.6% f: ratio m/f=3.3: from 11 to 16 years 26/4% m: 6% f: ratio m/f=4.3).

The information shown above (gender and age group) could be given in a 2x3 table (two rows: one for males, one for females; three columns: 0-5, 6-10 and 11-16 years of age) and the number of individuals of each sex within each age group would be given in the six cells of the table. The three dimensional bar-chart replaces each of the six numbers with a bar of the appropriate height; however, because of the three dimensional aspect of the display it is not possible to read off the original numbers. For instance, the legend states that within the 0-5 age group there were 22% males and 18% females, we can see from the bar-chart that there are more males than females, but we cannot read off these exact values. The display is used to impart only 6 figures, and it has lost those! 

Air Pollution 3D Bar Chart
It appears that for most of the years ozone was the major component of air quality standard. In 1955 sulphur dioxide was the main feature. It is not possible to read off the actual figures. This data could have been shown as a 7x5 table.

These displays may look impressive, but they are not generally an effective way of imparting the information with minimal loss of relevant information.

Side-by-side or stacked bar charts may be an effective way of presenting data on two categorical variables.

The following three examples of side by side bar charts show the data more effectively than the corresponding 5x4, 3x3 and 2x7 contingency tables would.

(1) Ref: Farmer P et al. Community based treatment of advanced HIV disease: introducing DOT-HAART (directly observed therapy with highly active antiretroviral therapy). Bulletin of the World Health Organisation, 2001, 79(12), 1145-1151.

Antiretrovirus Side-by-side Bar Chart
(2) Ref: Ahmed et al. Neonatal morbidity and care-seeking behaviour in rural Bangladesh. Journal of Tropical Pediatrics, 2001, 47, 98-105

Neonatal Morbidity Side-by-side Bar Chart
The following stacked bar chart is a very effective means of illustrating the fall in neural tube pregnancies over the years 1968-1990 together with the increasing terminations, probably reflecting earlier detections in later years. 

Ref: Hey et al, Use of local neural tube defect registers to interpret national trends, Archives of Disease in Childhood,1994;71,F198-202.

Neural Tube Defects Stacked Bar Chart
Occasionally a picture may help the presentation. For example, two categorical variables, wound location and pathology (neuropathic/Charcot-related ulceration), are shown by these feet:

Feet Wounds Picture
Figure 1: Wound location by pathology group.(a) Distribution of ulcers as a percentage of the total in the patients with neuropathic foot ulceration. Same data for the group with Charcot related ulceration.

This display is unusual but probably more effective than the corresponding 2x9 table of pathology x location in illustrating the data. A side by side bar-chart could also have been used. 

A second example below shows the population density of all London boroughs in 1801, which is best presented as a map rather than a graph, but the colour gradient only gives a rough indication of the population density so some of the data is lost. The exact numbers could have been written as text.

Ref: Cheshire, James. Spatial Analysis.co.uk. February 16, 2011. http://spatialanalysis.co.uk/2011/02/mappinglondons-population-change-20... (accessed August 2017).

London Population Density

TWO VARIABLES - Summary of displays for two variables

Types of variablesSuitable methods of display
Two numericScatterplot
One numeric, one categoricalDot plots
Two categorical

Contingency table

Side-by-side bar chart

Stacked bar-chart

Graphical Displays: More Than Two Variables

There are several ways in which the displays shown for two variables can be extended to incorporate information for additional variables. 

Scatterplots incorporating different symbols

Scatterplots can incorporate different symbols to display an extra, categorical, variable. For example, the following scatterplot shows that FRC (lung function) tends to increase with increasing height and tends to be lower at any given height amongst children born preterm.

For example:

(1) Ref: Thompson PJ and Greenough A, Hyperinflation in premature infants at preschool age, Acta Paediatrica., 1992; 81: 307-10.

FRC preterm scatter plot
Ref: Milliner et al, Results of long-term treatment with orthophosphate and pyridoxine in patients with primary hyperoxaluria, New England Journal of Medicine, 1994; Vol 331, No.23, 1553-8.

Pretreatment and Treatment Urinary Oxalate Scatter Plot
Figure 1. Urinary Oxalate Excretion before and during the First 2 to 14 Months of Treatment with Orthophosohate and Pyridoxine in 21 patients with Type (i), Type (ii), or an Undetermined Type(iii) of Primary Hyperoxaluria.

A categoric variable can be incorporated into a dot plot in the same way. Here is the dot plot we saw earlier in this chapter, presenting the time for a scorpion to catch its prey:

(iii) Ref: Edmunds MC, Sibly RM. Optimal sting use in the feeding behavior of the scorpion Hadrurus spadix. Journal of Arachnology. 2010 Apr;38(1):123-5.

Scorpion Capture Time Dot Plot
Figure 2: Capture time and sting use in relation to prey activity. Prey activity was manipulated to be low, medium or high by keeping prey in a refrigerator for 15, 10 or 5 mins, respectively. •  =  sting used, o  =  sting not used. Jitter has been applied to x coordinates to make overlapping points visible.

Contingency tables for more than two variables

Contingency tables can be used to display three or more variables and their associations. For example, the following table shows the relationship between illness severity and baby check score and how this differs between those seen at home and those seen in hospital.

E.g. Ref: Morley CJ, Thornton AJ, Cole TJ, Hewson PH, Fowler MA, Baby Check: a scoring system to grade the severity of acute systemic illness in babies under 6 months old, Archives of Disease in Childhood, 1991; 66: 100-106.

Baby Illness Table
It is not usually very helpful to include more than three variables in one table since interpretation becomes very difficult.

Side-by-side and stacked bar charts

Bar charts can also be extended by combining stacked with side-by-side, although this may be confusing. For example in the following chart combining side-by-side with stacking means that a lot of information and relationships between variables are given, but it is difficult to pick out important features.

E.g. Ref: Mendelsohn D, Levin HS, Bruce D et al, Late MRI after head injury in children: relationship to clinical features and outcome, Child's Nerv Syst, 1992; 8:445-452.

Lesion Location Stacked and Side-by-side Bar Chart