XClose

UCL Great Ormond Street Institute of Child Health

Home

Great Ormond Street Institute of Child Health

Menu

Chapter 10: Beyond T-tests

Previous chapters have considered analyses for either one or two variables that include either paired or unpaired groups/measurements:

Summary Table

Any of the above scenarios can be extended to incorporate additional groups or measurements: 

  • More than two independent groups may be studied 

E.g., comparing treatment efficacy for unmatched/independent samples (mild, moderate, and severe disease, or new drug, current standard drug, and placebo).

  • More than two matched groups may be studied 

E.g., comparing a new treatment, an existing treatment, and a placebo, in groups that are individually matched for age, sex, height, weight, and disease severity.

  • A single group may be measured at more than two time-points 

e.g. comparison of measurements at baseline, post-intervention, and follow-up(s)

However… When more than two groups or related measurements need to be compared, it is WRONG to analyse every combination of two groups/conditions from the total number using separate 2-sample tests (e.g., t-tests or chi squared tests). Similarly, when you are interested in the associations between more than two variables you SHOULD NOT compare all possible combinations of two variables using multiple correlations.

Multiple Testing

Performing lots of separate tests on multiple groups, multiple variables, or repeated measurements is statistically unsound, and in most cases this renders the p-values for each test invalid, unless they are adjusted for multiple testing. 

The p-values yielded from statistical tests tell us about the probability of making an error in our conclusions. For instance, a p-value of 0.045 means that 4.5 times in 100, or 45 times in 1000 repeats of the study, we would expect to get the observed sample difference or association (or larger), when the null hypothesis is actually true. So, 4.5 times in 100 (4.5% of the time) we would identify a test result as significant, when there is actually no genuine difference/associations in the population - this is known as a type I error. 

Therefore, the cut-off value we select to determine when something is statistical significant (α) should reflect the rate or likelihood of error that we are willing to accept for any given test: A cut-off of 0.05 means we accept a type I error rate of 5%; to be more conservative the cut-off could be 0.01, which would reduce the type I error rate to 1%. 

This error rate applies to every statistical test that is carried out, so conducting multiple tests and comparisons directly increases the overall chance of making a type I error (i.e., finding something spuriously significant) within the group of tests carried out. The chance increases as the number of tests increases.

With a 5% chance of error, one in every 20 tests could be spuriously significant - if you test enough things you're almost bound to find something significant, but it may not be clinically/practically/theoretically useful or meaningful.

The best approach is to avoid multiple testing completely by making fewer comparisons or using more sophisticated statistics that allow simultaneous comparison of multiple groups/variables. However, if multiple tests are necessary, then a more conservative approach must be taken to the significance of each test by 'correcting' or 'adjusting' for multiple testing (the threshold for significance is lowered for each individual test). 

Correcting for Multiple Testing

There are many different methods for correcting for multiple testing that differ in the specific way and extent to which they adjust the threshold for significance, but one of the most commonly known and cited methods is the 'Bonferroni Correction'. 

The Bonferroni correction divides the overall α (the type I error rate) by m (number of tests conducted); individual tests are then only deemed significant if the p-value is <α/m.

E.g., If we conduct a group of 10 tests; roughly speaking we have about* a 50% chance of finding something significant among the tests if each is deemed significant at p<0.05. 

*The precise risk increase is: 1 minus the likelihood of no error (0.95) raised to the power of the number of tests you want to do, so a 1- (0.9510) = 0.401 or 40.1%.

This error rate is too high to accept in any research scenario, so we take the original cut-off (0.05) and divide it by 10 (the number of tests), and apply the adjusted cut-off to each individual test. In this case, any of the tests would have to be significant at p<0.005 to be considered significant. The combined error rate will be 1- (0.99510) = 1- 0.95 = 0.05 or 5% as required.

Whilst this is a statistically valid approach to multiple testing, it is worth considering the practicality and applicability of results from multiple tests. Even in the small number of cases where multiple testing is valid, this approach to the comparison of groups and variables is very unlikely to answer any useful research question.

e.g. Oostdijk W, Hummelink R, Odink RJH, Partsch CJ, Drop SLS, Lorenzen F, Sippell WG, Van der Velde EA, and Scultheiss H, Treatment of children with central precocious puberty by a slow-release gonadotropin-releasing hormone agonist, Eur. J. Pediatr., 1990; 149; 308-313.

Researchers wanted to assess the efficacy of an intervention (hormone treatment) on a particular outcome (predicted adult height) in a specific population (girls with central precocious puberty). To investigate this question they compared a group of individuals who received the intervention to another group of similar individuals who didn't receive the intervention. The study compared outcome scores for each group at multiple time-points (baseline, 6 months, 12 months, and 18 months), within each subgroup and for the total combined group. This yielded significance tests between multiple groups (intervention, control, combined) at multiple times. Significant differences are shown in the figure below:

Oostdijk et al (1990) precocious puberty

This data could be used to answer a useful research question such as:

How does hormone treatment effect predicted adult height over 18 months in girls with central precocious puberty?

However, the graph above suggests that lots of separate tests were carried out to compare all of the time-points to 18 months post-treatment in group 1 and the combined group (suggested by the lines and asterisks above the bars). This approach to analysis answers a question more like the example below:

Is there a difference in predicted adult height between start of treatment and 18 months after treatment, between 6 months and 18 months after treatment, and/or between 12 months and 18 months after treatment for all girls and/or for girls in group 1?

The interpretation of the results of multiple significance tests will never be easy and can be open to subjectivity. Instead of carrying out several independent tests and trying to summarise the multiple results, there are alternative methods (e.g., ANOVA, Regression Analysis) that allow you to consider all groups, measurements, and variables that are relevant to the research question, at the same time. These methods of analysis are often called 'models' rather than 'tests' because of the way multiple variables can be concurrently examined in relation to (or, to explain) variation in a given outcome.

The most suitable method to use will depend, in part, on the number and types of variables that need to be analysed, and in part, on personal choice (there is a certain amount of crossover between ANOVA and Regression).

The following sections will describe what ANOVA and Regression tests/models are, and in what contexts they can and should be used.

ANOVA

Analysis of Variance, or ANOVA, is a method for assessing how numeric outcomes differ between groups. In its simplest form, ANOVA represents an extension of the two-sample t-test; both ANOVA and two-sample t-tests are used to analyse the relationship between independent variables (also known as predictors, factors, or explanatory variables) and dependent variables (outcomes), where the independent variable is categorical, and the dependent variable is numeric. However, while a two-sample t-test can only ever compare two groups/categories at a time, ANOVA is used to compare between 3 or more groups/categories at the same time. 

E.g., if we compare heights between two groups (e.g., treated and untreated) using a t-test, the independent variable is the treatment group, and the dependent variable is height. If we extended the same study to include three groups (e.g., treated, untreated, placebo), treatment group is still the independent variable and height is still the dependent variable, but we would use an ANOVA to compare all three groups at the same time.

There are several different types of ANOVA model, which are distinguished by the number of independent and dependent variables included in a single model/test, and whether or not there is pairing in the data. The most basic model is the one-way ANOVA.

ANOVA: One-way ANOVA

The one-way ANOVA is the simplest type of ANOVA model, and the most similar to a t-test. One-way ANOVA is used to compare a single numeric outcome between three or more groups. Thinking back to a two sample t-tests; the null hypothesis refers to there being no difference between the groups being compared (e.g., there will be no difference in the mean heights of men and women), and hence no difference between the populations from which the groups were sampled. The null hypothesis being tested in a one-way ANOVA states that all (rather than just both) of the samples are randomly selected from populations that have the same mean; or, more simply, that there is no difference (in means) between any of the groups. 

Although several groups are compared simultaneously, an ANOVA model only produces a single p-value. The p-value tells you the probability of obtaining samples whose means differ at least as much as they do in the observed data, if the null hypothesis is true. 

Even if populations did have the same means, we wouldn't expect the random samples taken from those populations to all have exactly the same mean (random sampling variation leads to slight differences between groups even when there are no differences on average in the population). As such, a significant p-value from an ANOVA tells us that the mean of at least one of the samples is more extreme (further away from the others) than we would expect to occur by chance through random sample variation. However, it does not tell us which group or groups were different to which other group or groups. To identify the specific groups that differ, we should start by examining the data visually using dot plots. A significant p-value could indicate that:

a) One sample is more extreme than one other sample (e.g., the least and most extreme scoring groups):

Albrecht et al (2007) dot plot

Ref: Albrecht, P., Fröhlich, R., Hartung, H. P., Kieseier, B. C., & Methner, A. (2007). Optical coherence tomography measures axonal loss in multiple sclerosis independently of optic neuritis. Journal of neurology, 254(11), 1595-1596.

b) One sample is more extreme than all other samples:

Ref: Ferraro, A., et al. (2011). Expansion of Th17 cells and functional defects in T regulatory cells are key features of the pancreatic lymph nodes in patients with type 1 diabetes. Diabetes, 60(11), 2903-2913.

Ferraro et al 2011 dot plot 2

c) All samples are extreme (i.e., they all differ notably from each other):

Ferraro et al 2011 dot plot 3

Ref: (as above) Ferraro, A., et al. (2011)

Or...

Ferraro et al 2011 dot plot 4

Ref: (as above) Ferraro, A., et al. (2011)

Dot plots can give a good idea of what is happening in the data, but it's not possible to determine which groups differ significantly using the plot alone. Therefore, group differences should also be examined statistically, using follow-up or 'post-hoc' tests. 

Significant ANOVA models must be followed up in order to understand the specific pattern of differences underlying the result, but non-significant ANOVA do not need to be followed up. This is because a non-significant model indicates that all between-group differences are non-significant after adjusting for multiple testing.

Post hoc tests can be carried out via the main ANOVA analysis in statistical software by selecting 'post hocs' (sometimes called 'multiple comparisons' or 'multiple contrasts') and choosing an adjustment method (e.g., Bonferroni). Alternatively, post hocs can be carried out by running separate two sample t-tests and manually applying a correction for multiple testing, as described earlier in the notes.

Data Assumptions of ANOVA

ANOVA is a parametric statistical method and as such, its valid use is dependent on a number of data assumptions being met. These assumptions mirror those of the two sample t-test, and dot plots of the data can be useful in assessing whether or not these assumptions have been met:

  • The data in each group should be approximately normally distributed 
  • The sample in each group should be large enough (>20 as a rough guide) 
  • The standard deviation of the groups should be approximately equal (no standard deviation should be more than twice the size of any other)
Nonparametric ANOVA

If the assumptions of a parametric ANOVA are not met then it may be possible to transform the data to achieve normality and/or approximate equality of standard deviations. However, the same transformation must be used for all groups, and so transformation will only be a suitable option if all groups have similar distributions.

If the assumptions are not met and a suitable transformation cannot be found, then a nonparametric alternative to the one-way ANOVA can be used. The non-parametric equivalent of the one-way ANOVA is the called the Kruskal-Wallis ANOVA. 

The Kruskal-Wallis ANOVA tests the null hypothesis that the samples (groups) are from identical populations. The Kruskal-Wallis ANOVA does not make any data assumptions, meaning that it can be used with small groups, when data in any or all of the samples are skewed, and when the variability (standard deviation) within groups differs to a great extent between groups.

Examples

1) Ref: Gordon N et al, Rapid detection of hereditary and acquired platelet storage pool deficiency by flow cytometry, British Journal of Haematology, 1995; 89, 117-123.

A one-way ANOVA could be used to compare mean flow cytometry between the 4 groups of patients shown in the plot; however, the samples are small in each group, and the difference in the variance/standard deviations between groups may be too great for the parametric test. As such a Kruskal-Wallis ANOVA may be more valid.

Flow cytometry dot plot

Fig 3. Mepacrine staining by flow cytometry (per cent positive platelets) in normal controls, myeloproliferative disorders/myelodysplastic syndromes (MPD/MDS), delta-storage pool disease (SPD) and other platelet disorders (Others). The horizontal bar denotes the mean for each group.

2) Ref: Cetinel B et al, Update evaluation of benign prostatic hyperplasia: when should we offer prostatectomy? British Journal of Urology, 1994; 74, 566-571.

The data in the 3 groups is upwardly skew. A log-transformation could be used to try and normalise the data in each group. If successful, this would make the data suitable for a parametric one-way ANOVA.

CETINEL 1994 benign prostatic hyperplasia

3) Ref: Neilly IJ et al, Plasma nitrate concentrations in neutropenic and non-neutropenic patients with suspected septicaemia, British Journal of Haematology, 1995; 89, 199-202.

Plasma nitrate measurements in the non-neutropenic hypotensive group are more spread and skewed than the other groups. Transformation is unlikely to make all 5 groups approximately normally distributed with equal variance. A Kruskal-Wallis analysis of variance could be used to compare the medians of all 5 groups simultaneously:

Neilly 1995 plasma nitrate
Categorical outcomes

When both the independent (predictor, factor or explanatory) and the dependent (outcome) variables are categorical, then ANOVA is not appropriate. The data can be displayed as a table and the previously described Chi Squared Test can be used to compare 3 or more groups/categories. 

In this case, only one p-value will be given and multiple separate tests would be needed to examine where significant differences between groups/categories lie. These multiple tests will need to be adjusted in a similar way to the 'post hoc tests' used with ANOVA.

Example

Ref: Chrcanovic BR et al, Facial fractures in children and adolescents: a retrospective study of 3 years in a hospital in Belo Horizonte, Brazil, Dental Traumatology, 2010; 26: 262-270. 

Data was collected on patients that had suffered trauma between 1 January 2000 and 31 December 2002 who had been attended to at Maria Amélia Lins Hospital in Belo Horizonte, Brazil. The following table shows the distribution of patients with and without fractures according to aetiology:

Table facial fractures

A chi-squared test to compare the proportion of patients with and without fractures between the various aetiologies found a highly significant difference (p<0.001). To formally investigate where this difference arose from, multiple tests were carried out using Bonferroni's correction (a post-hoc adjustment). As there were five categories, 10 significance tests would have to be carried out and results of p<0.05/10 = 0.005 were considered 'significant'. The table below shows the results of these multiple tests; results with a * are significant after this Bonferroni correction:

Table facial fractures 2

These multiple tests show that the percentage of patients suffering a fracture after a fall (54.3%) was significantly lower than all other causes except for 'other' aetiologies.

When assessing the association between a binary variable and an ordinal variable, a Chi Squared for Trend should be used to take account of this ordering.

Both the Chi Squared test and the Chi Squared test for trend assume a sample size of at least 20 within each group and no extreme proportions (< 0.1 or > 0.9) or percentages     (< 10% or > 90%). Neither of these methods for analysing categorical outcomes can adjust for confounding factors; for this, regression analysis would be required.

Regression

Regression analysis is a way of describing and quantifying the variability in an outcome. Regression is a very important statistical tool as it allows us to adjust for confounding factors; for example, when comparing average hospital stay between patients with a complicated clinical course (CCC) and smooth clinical course (SCC), we may wish to adjust for patient age and gender.

There are many different types of regression. The particular form that must be used depends on the nature of the outcome variable (also known as the response or dependent variable) and the number of predictors (or independent variables). For all types of regression the predictors may take any form (numeric, categorical or a mixture of the two).

Linear regression is the most common type of regression and is appropriate when the outcome is continuous numeric. For example, to compare duration of hospital stay between clinical courses, to investigate the association between Flow Mediated Dilation (FMD) and pubertal mean TSH, or estimate the difference in birthweights of CMV infected babies and non-infected babies after adjusting for gestational age.

Logistic regression is appropriate when the outcome is binary. For example, to compare the proportion of babies born with cleft palate only between syndromic and non-syndromic births after adjusting for birthweight and family history, or investigate the association between infantile colic and cystic fibrosis whilst adjusting for duration of breast-feeding. 

Regression: How does regression work?

Example:

Ref: Race, Skills, and Earnings: American Immigrants in 1909, The Journal of Economic History, 1971, 31, 421-428.

In 1909, the US Immigration Commission collected data regarding the wages of different nationalities of immigrants; they aimed to investigate, amongst other things, the hypothesis that 'new' immigrants from southern and eastern Europe were earning less that the 'old' immigrants from north-western Europe. This data is given in the table below:

Regression aims to fit a model to the data in the form of an equation; this equation relates the outcome/dependent variable (on the left-hand side) to the predictor/independent variable(s) (on the right-hand side).  For each independent variable in the equation, a coefficient is obtained that quantifies the extent to which changes in the outcome are associated with changes in that predictor/independent variable. For example, when investigating the association between length of hospital stay and age, the coefficient estimated in the fitted equation will give an amount by which the average hospital stay increases (or decreases) for each year older a patient is. The p-value for this model (with a single continuous predictor variable) will be the same as the Pearson's correlation coefficient. If the single predictor variable would have been binary, the p-value would have matched that given from a two-sample t-test and the coefficient would be the observed mean difference between the two samples.

Regression example

Before any formal analysis is carried out, we may want to display the data graphically to investigate potential outliers or relationships; as we want to display two continuous variables, a scatterplot would be the most appropriate method:

SCATTER DIAGRAM OF AVERAGE WEEKLY WAGE AND THE PERCENTAGE THAT HAVE LIVED IN THE US FOR OVER 5 YEARS

Examination of the scatterplot suggests that average weekly wages tend to increase as the percentage of people who have lived in the US for at least 5 years increases. To quantify this relationship, we could fit a linear regression model setting average weekly wage as the outcome/dependent variable and percentage living in the US over 5 years as the predictor/independent variable.

Regression example scatter

Examination of the scatterplot suggests that average weekly wages tend to increase as the percentage of people who have lived in the US for at least 5 years increases. To quantify this relationship, we could fit a linear regression model setting average weekly wage as the outcome/dependent variable and percentage living in the US over 5 years as the predictor/independent variable. 

The regression equation was found to be:

Predicted Average weekly wage = 8.304 + 0.061 * Perc living in the US > 5 years

This can be visualised by adding it to the scatterplot:

Regression example scatter with line

Here, the coefficient value (0.061) quantifies the average change in weekly wage for every unit increase in percentage living in the US for over 5 years; i.e for every 10 point increase of population living in the US over 5 years, the average weekly wage of that group is expected to increase by 61 cents (10 x $0.061). 

Most software packages give the standard error of the average change (or gradient of the line). We can calculate a confidence interval for the slope and test the null hypothesis that there is no association between the outcome/dependent and predictor/independent variables (gradient of line = 0).

E.g. the standard error of the 0.061 estimate was 0.009. A 95% confidence interval for the increase in average weekly wage per unit increase in percentage living in the US is given by:

(0.061 ± (1.96 x 0.009)) = (0.043, 0.079)

Testing against the null hypothesis of no change, we find how many standard errors away from the null hypothesis (0) our sample estimate lies and use the normal table to assign a p-value to this difference:

14 equation

Hence there is evidence that average weekly wage (dependent) varies according to the percentage living in the US more than 5 years (independent); i.e. the 'old' immigrant nationalities earned more on average than the 'new' immigrant nationalities. We are 95% confident that the average increase in average weekly wage for a 1 point increase in the number of people living in the US for at least 5 years is between 4.3 and 7.9 cents.

Regression: Linear Regression with multiple predictors

When multiple predictor variables are included in a model, the coefficient estimates reflect the association between the outcome and a predictor after adjusting for all other variables in the model. For example, if investigating the relationship between length of stay in hospital and patient age, we may want to adjust for the potential confounder clinical course (simple or complex); to do this, clinical course was also entered into the model and this coefficient now gives the amount by which the average hospital stay increases (or decreases) for each year older a patient is after adjusting for clinical course.

Example:

The US Immigration Commission were concerned that the results from the previous linear model may be confounded by the levels of literacy in each group of immigrants; to overcome this potential confounder, the percentage of immigrants that were literate was also entered into the regression model. The dependent variable is still the average weekly wage but we now include two explanatory variables:

1) Percentage living in the US for over 5 years.

2) Percentage literate.

The updated regression equation was found to be:  

Predicted Average weekly wage = 1.903 + 0.029*Perc living in US > 5 yrs + 0.096*Perc literate

After adjusting for the percentage of each group that were literate, estimates for the relationship between wages and time spent living in the US has changed so the expected increase in average weekly way for every 1 point increase in the group that has lived in US for over 5 years is now 2.9 cents (as opposed to the 6.1 cents) after adjusting for differences in literacy levels. 

The confidence interval and p-value for the coefficient associated with the percentage living in US over 5 years have also changed; the p-value after adjusting for the new predictor is now 0.004 meaning it is still significant or incompatible with no relationship.

Regression: Dummy variables

Categorical explanatory variables can be introduced into regression models using dummy variables. By taking the values 0 or 1 depending on which category is being modelled. We need one less dummy variable than the number of categories to incorporate a categorical variable into the model. 

Dummy variables can also be used to carry out subgroup analysis; subgroups enter the model as dummy variables, the coefficients give an estimate of the differences between subgroups. We need one less dummy variable than the number of subgroups.

Example:

The Immigration Commission wanted to investigate if the area immigrants originated from was associated with weekly wage. Countries were categorised by region into north-western Europe (English, Swedish, Danish, etc.), other Europe (Greek, Lithuanians, etc.) and others. To add this nominal variable into the model, two variables were created to show which category each country fell into. Note that one less variable (2) than the number of categories (3) is required:

15 equation

This covers all possibilities:

  • A group from north-western Europe will have nw_eur = 1, other_eur = 0
  • A group from another part of Europe will have nw_eur = 0, other_eur = 1
  • A group from outside of Europe will have nw_eur = 0, other_eur = 0

The fitted regression model is:

Predicted Average weekly wage = 7.645 + 0.052*Perc living in US + 2.086*nw_eur + 1.012*other_eur

Groups from outside Europe are therefore predicted to have average weekly wage of 7.645 + 0.052*Perc living in US. Groups of immigrants from north-western Europe earned $2.08 more on average per week than groups from outside of Europe and $1.07 ($2.086 - $1.012) more on average per week than immigrants from other areas of Europe, after adjusting for other predictors in the model. The p-values for these coefficients were 0.006 and 0.138 respectively. Although the other_eur coefficient is not significant at a 5% level, it would be wrong to remove this without the nw_eur variable as they are part of the same predictor/independent variable (area of origin).

The relationship between average weekly wage and percentage of immigrants living in the US for over 5 years after adjusting for origin can be shown in the plot below, The fitted model on the plot consists of 3 parallel lines, with the lines for the NW and Other Europeans being 2.086 and 1.012 respectively above the line for the Other Europeans.

Regression example multiple scatter

The aim of regression analysis is to find the best model possible to answer a research question; this may mean that the best possible model may not have the smallest p-value or the most coefficients. A certain level of common sense must be used. A common approach to model fitting that is often seen in journals is to:

1) Test each potential predictor separately - univariable associations

Then

2) Enter into a multiple regression model only those predictors that are significant univariably.

This is WRONG! DO NOT USE THIS APPROACH!!!

Regression models with multiple predictors give the association between a predictor and outcome after adjusting for other predictors. Failure to adjust for these multiple underlying relationships can mean true associations can be missed.

For example: Consider the following situation where outcome is associated with age but not with sex independently. The association with sex though is strong once age is adjusted for.

a) The model fitted to show the association between gender and outcome1 was given:

Outcome1 = 13.001 + 0.917*boy

The average difference between boys and girls is 0.917 and is not significant (p=0.19).

SPSS dot plot

b) After adjustment for age, the new model is given:

Outcome1 = 1.783 + 4.197*boy + 1.997*age

The difference between boys and girls has changed to 4.197; this difference is now highly significant after adjusting for differences in age (p < 0.0005).

SPSS scatter plot

A better method is to use tools such as information criterions that give a measure of how well a model fits the data; these measures can be used to compare similar models with the aim of finding the most parsimonious model (the simplest model that explains the data sufficiently). 

The AIC (Akaike Information Criterion) and BIC (Bayesian Information Criterion)  give measures of how well a model fits the data whilst assigning a penalty based on how many predictor/independent variables are included. The smaller their value, the more parsimonious a model is. 

The AIC and BIC are useful tools for model fitting, however they can only be used to compare similar models for the same data and don't give a measure of how well the model fits in general. If all the models being tested fit poorly, these scores will not give an indication of that, other diagnostic tools should be used to test the validity of models. 

Regression: Model Diagnostics

Model assumptions

Regression analysis is a parametric statistical method with certain assumptions that must be true of the data for results to be valid. The assumptions that must be checked depend on the regression model being used. However, all regression types fail or produce unstable results when multicollinearity exists.

Multicollinearity occurs when a predictor/independent variable can be predicted to a large degree of accuracy by one or more of the other independent variables. This can cause problems and produce misleading results. To identify collinearity in a model, the variance inflation factor (VIF) can be used. A higher VIF value indicates a larger degree of multicollinearity. A VIF of greater than 10 is considered very high as it means that more than 90% of the variation in that predictor can be explained by the other variables in the model; a VIF of greater than 5 usually warrants further investigation. 

Residuals

Residuals, or error terms, are the differences between the predicted outcome given by a fitted model and the observed outcome value. A residual value can be calculated for each individual in the sample used to fit the model. In a linear regression model, the residuals can be interpreted directly and should be normally distributed around zero for all predictor variables. For other regression types, the residuals may need to be transformed or standardised before being interpreted due to the nature of the data.

In the previous example where the association between average weekly wages and the percentage of immigrants that had lived in the US for over 5 years was investigated, each group has a residual calculated by subtracting their predicted weekly wage (from the model below) from the observed weekly wage:

Predicted Average weekly wagei = 8.304 + 0.061percentage living in US > 5 yearsi

Residuals

In linear regression, if the response variable can on average sufficiently be described as a linear combination of the predictor variable(s), the residuals plotted against the observed response and each of the predictor variables will be evenly spread around 0 with no obvious pattern. The plot below shows the residuals of the linear model plotted against predicted weekly wages:

Residuals graph

This plot shows no obvious patterns with most residuals lying very close to the zero line indicating a good fit. 

Influential observations

Outlying or overly influential observations can be identified by re-fitting the model(s) after deletion of each observation in turn. If the model changes substantially, then that particular observation/individual/country had a high influence on the fitted model and should perhaps be investigated to identify what differences they had to the main body of data. Statistics such as Cook's Distance can be calculated to quantify the level of influence each individual has on the model.

Regression: Logistic regression

Logistic regression is used when the outcome/dependent variable of interest is binary. The results given in the modelling process are the same as seen above in linear regression (coefficient estimates, confidence intervals, p-values) but the outcome must be transformed before it can be put into the modelling framework. This means that interpretation of results is more complex and requires transformations of estimates.Logistic regression has different assumptions to linear regression that must also be checked. When a model is fitted, it is still important to use diagnostics to check how well a model fits and identify any potentially influential observations.