Close

UCL Great Ormond Street Institute of Child Health

Home

Great Ormond Street Institute of Child Health

# Chapter 12: Revision/Evaluation

This chapter aims to evaluate your learning and understanding of the concepts and calculations we have covered in the course and materials for 'Introduction to Research Methods and Statistics.

The chapter contains various short questions divided according to their format.

1) True or False Questions: A short introductory statement/question is followed by a series of short answers/responses; these statements should be rated as true or false. The questions cover all aspects of the course and are in no particular order of difficulty.

2) Best-of-Five Questions: In these questions a series of 5 statements follows an introductory statement. One or more of the statements may be correct. You should select the ONE best answer i.e., if there are 2 correct responses, then the best of these should be chosen.

3) Extended Matching Questions: In these questions a series of up to 10 'items' are given, followed by a 'lead-in' and then a series of 'stems'. For each of the stems, the student must choose the correct item from the list given the information in the lead-in. Each item may be used more than once.

Solutions for all questions are given after the relevant section.

True or False Questions

Each statement is either true or false. None, any combination or all of the statements for a given question number may be true. Alongside each statement please state whether you think it is 'True' or 'False'.

(1) The mean of a large sample of size n:

(a) Is the same as the median if the sample is distributed symmetrically

(b) Is calculated by adding together all of the values and dividing by n

(c) Is always a reasonable measure of centre

(d) Is always greater than the standard deviation

(2) The following are measures of the spread of a distribution:

(a) Inter-quartile range

(b) Standard deviation

(c) Range

(d) Correlation coefficient

(e) Mode

(3) In a medical paper 150 patients were characterised as 'Age 26 years ± 5 years (mean ± standard deviation)'. If the ages were normally distributed, this means that:

(a) It is 95% certain that the true mean lies in the interval 16-36 years

(b) Most of the patients were aged 26 yrs; the remainder were aged 21-31 yrs

(c) Approximately 95% of the patients were aged between 16 and 36 years

(4) A pharmacokinetic investigation, including 216 volunteers, revealed that the plasma concentration one hour after oral administration of 10mg of the drug was 188ng/ml ± 10ng/ml (mean ± standard error). This implies that:

(a) 95% of the volunteers had plasma concentrations between 168 and 208ng/ml

(b) The interval 168 to 208 ng/ml is the normal range of the plasma concentration 1 hour after oral administration.

(c) We are 95% confident that the true mean lies somewhere within 168-208 ng/ml

(5) The standard error of the mean of a sample:

(a) Can be calculated using the standard deviation of the population and the size of the sample

(b) Measures how far the sample mean is likely to be from the population mean

(c) Is greater than the standard deviation of the observations

(6) A new drug was tested independently in 2 randomised controlled trials. The trials appeared comparable and comprised the same number of patients. One trial led to the conclusion that the drug was effective ( p < 0.05), whereas the other trial led to the conclusion that the drug was ineffective ( p>0.05). The actual p-values were 0.041 and 0.097. This means that:

(a) The first trial gave a false -positive result

(b) The second trial gave a false-negative result

(c) Obviously, the trials were not comparable after all

(d) One can't attach too much importance to small differences between p-values

(7) A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo; p<0.05. Therefore:

(a) It has been proved that the treatment is better than the placebo

(b) If the treatment is not effective, there is less than 5% chance of obtaining such results

(c) The observed effect is so large that there is less than a 5% chance that the treatment is no better than placebo

(8) Patients are asked for consent to partake in a study of a new treatment for hypertension. 95% agree to be randomised to one of three treatment groups. 90% of those that agree to partake continue on their allocated treatment for the specified time period, the main outcome measure is the change in each individual's blood pressure:

(a) The appropriate analysis is a one-way analysis of variance of the blood pressure changes.

(b) No information should be collected from those that refuse to participate.

(c) The analysis should include even those that do not complete their allotted treatment.

(d) Two-sample t-tests should be used to determine significant differences between the groups.

(e) If a one-way analysis of variance gives a significant p-value, then t- tests should be used to identify which group is responsible for that difference.

(9) In a small randomized double-blind trial of a new treatment in acute myocardial infarction, the mortality in the treated group was half that in the control group, but the difference was not significant. We can conclude that:

(a) The treatment is useless

(b) There is no point in continuing to develop the treatment

(c) Reduction in mortality is so great that we should introduce the treatment

(d) The confidence interval for the difference between the groups will be wide

(e) We should carry out a new trial of much greater size

(10) Non-parametric statistical tests:

(a) Are preferred to methods which assume the data to be normally distributed

(b) Are less powerful than methods based on the normal distribution when the data are normally distributed

(c) Often involve ranking the data

(d) Are preferred when data cannot be assumed to follow any particular distribution

(11) A study is designed to examine the relationship between blood pressure and occupation. Which 2 of the following must occur for age to be a confounding factor?

(a) The correlation between age and blood pressure is significant

(b) Blood pressure changes with age

(c) Different occupational groups have different age structures

(d) Age is linked to diet and diet affects blood pressure

(e) Within a given occupational group, age and blood pressure are linearly related

(12) The diastolic blood pressures (DBP) of a group of young men are normally distributed with a mean of 70mmHg and a standard deviation of 10mmHg. It follows that:

(a) About 95% of the men have a DBP between 60 and 80 mmHg.

(b) About 50% of the men have a DBP above 70mmHg

(c) The distribution of DBP is not skewed

(d) All the DBPs must be less than 100 l/min

(e) About 2.5% of the men have DBP below 50 mmHg

(13) If the size of a random sample were increased, we would expect:

(a) The mean to decrease

(b) The standard error of the mean to decrease

(c) The standard deviation to decrease

(d) The sample variance to increase

(e) The accuracy of the parameter estimates to increase

(14) Randomisation of patients to treatments within a trial ensures that:

(a) The patient in unaware of the treatment group to which they are assigned.

(b) Each patient has an equal chance of being in any treatment group

(c) Treatment group is known before consent is obtained

(d) Although individuals receive different treatments, each patient will be allocated to the treatment most likely to benefit them

(e) Differences between treatments will be significant.

(15) Within a clinical trial, where possible allocation of patients to treatments should be:

(a) Blinded

(b) Randomised

(c) Systematic

(d) Decided prior to obtaining consent

(e) Performed away from the study centre

(16) In a trial of vitamin supplementation on reaction times amongst 12 year olds, which of the following are necessary for social class to be a confounder?

(a) Reaction times differ between the supplemented and non-supplemented (control) groups

(b) The children were not randomised to treatment groups

(c) The supplemented group have a different social class distribution to the non-supplemented (control) group

(d) The non-supplemented (control) group are all children of professional parents

(e) Reaction times are associated with social class

(17) A new asthma inhaler is tested within a double-blind crossover trial. With this study design:

(a) Neither the patient nor the assessor know which treatment (standard/new inhaler) is being given at any time

(b) Any differences found between the new and standard inhaler must be significant

(c) The order of treatments (new/standard inhalers) should be randomised

(d) Less patients will be need than if a parallel trial of new versus standard inhalers had been used

(e) The outcome must be normally distributed.

(18) Observational studies:

(a) Cannot be randomised

(b) Give more convincing evidence of true differences than experimental studies

(c) Are always large

(d) Can never be useful

(e) Must be blinded

(19) Age and sex matched pairs of patients are allocated to new or standard treatments:

(a) Age and sex cannot confound the study results

(b) Randomisation to new or standard treatment should take place within-pair

(c) The treatment allocations must be blinded

(d) The pairing should be retained in the analysis

(e) Disease severity will be similar between the treatment groups

(20) Which of the following are categorical variables?

(a) Height

(b) Social class

(c) Age

(d) Gender

(e) Ethnicity

(21) When data is ranked:

(a) The highest rank is equal to the total number in the sample

(b) The median is the middle ranked value

(c) The lowest value has rank 1

(d) Any equal data values must be removed from the sample prior to ranking

(e) The mean of the values is the middle ranked value

(22) An oxygenation index is measured in a group of 30 children less than 10 years of age. The values obtained range from 2 to 250; the median value is 27 and the mean 60:

(a) The distribution of the oxygenation measurements is upwardly skew

(b) The best average value to use is the mean

(c) The mean is heavily influenced by relatively few children with high oxygenation values

(d) This oxygenation index is not a reliable measurement

(e) More measurements need to be made

(23) A group of 100 12 year old children with Fragile X syndrome undergo IQ testing. Their mean IQ is 97, standard deviation 5. The measurements are approximately normally distributed:

(a) The standard error of the measurements is 0.5 IQ points

(b) A 95 % confidence interval for the population mean IQ of 12 year old with Fragile X is (96,98) based on this sample

(c) Approximately 95% of the children have IQ measurements in the range 87 to 107

(d) The mean IQ in this sample is significantly different from the expected mean of 100 amongst normal children

(e) The distribution of IQ is skew

(24) The standard error of an estimate:

(a) Is smaller for larger sample sizes

(b) Is a measure of the precision of that estimate

(c) Cannot be negative

(d) Depends on the average value of the sample

(e) Is used to construct confidence intervals

(25) A p-value:

(a) Indicates the statistical significance of any differences seen in the sample(s)

(b) Lies between -1 and +1

(c) Is more useful in interpreting results than a confidence interval

(d) Is the probability of obtaining the current sample if the null hypothesis is true

(e) Indicates the clinical significance of any differences seen in the sample(s)

(26) A parametric correlation coefficient:

(a) Must be positive

(b) Of zero indicates no relationship between the measurements

(c) Takes the value 1 only if the points lie on the line of equality

(d) Shows the extent to which two continuous measurements are linearly related

(e) Is negative if there is no association

(27) Reflex times are measured in a group of 5-15 year olds. The correlation between reflex time and age is calculated as 0.76 (95% confidence interval 0.7, 0.82):

(a) The correlation coefficient is significantly different to zero

(b) More measurements need to be made

(c) There is a linear association between reflex times and age

(d) The association is clinically important

(e) Older children tend to have slower reflex times

(28) Blood pressures are measured in 2 groups. Those receiving some treatment have blood pressures that are on average 6 mmHg lower than those in the untreated group. A t-test was applied and p=0.02, 95% confidence interval for the difference (4.16, 7.84) mmHg:

(a) The treatment should be introduced as it may be clinically relevant

(b) The treatment must have improved blood pressure by at least 4.16 mmHg on average in the population

(c) The difference observed would have occurred by chance one time in 20 if there really were no treatment effect

(d) Randomisation to groups was not successful

(e) The t-test would not have been appropriate if the blood pressure measurements were skew

(29) The numbers of children positive for a certain genetic defect were compared between asthmatics and healthy controls of a similar age:

(a) The study is invalid because it is not randomised

(b) A t-test could be used to assess the significance of the differences between groups

(c) Chi-square could be used to test the significance of the differences in proportions positive amongst the asthmatic and healthy groups

(d) Age may be a confounder

(e) A confidence interval for the difference in proportions positive would assist in interpretation of the results

(30) Within a randomised controlled trial, of 100 individuals who received standard care, 40 had an adverse event in the following year this is in contrast to 20 of 100 who received a program of intensified care:

(a) Percentage reduction in adverse events attributable to intensified care = 20

(b) The number need to treat (NNT) to avoid one adverse event = 5

(c) Relative risk (RR) = 0.5

(d) A confidence interval for the percentage reduction will depend on the sample size

(e) Age may be a confounder in the comparison

(31) The average (mean) BMI score of 30 children with Turners syndrome is 20 and the standard deviation is 3:

(a) If the BMI values are normally distributed, then most of the measurements will lie between 14 and 26

(b) None of the children are clinically obese

(c) Plotting BMI against age will be informative

(d) Normality of BMI scores should be considered when interpreting these results

(e) Thirty children is too few to give any useful information

True or False Solutions

(1) The mean of a large sample of size n:

(a) Is the same as the median if the sample is distributed symmetrically (TRUE)

(b) Is calculated by adding together all of the values and dividing by n (TRUE)

(c) Is always a reasonable measure of centre (FALSE)

(d) Is always greater than the standard deviation (FALSE)

(2) The following are measures of the spread of a distribution:

(a) Inter-quartile range (TRUE)

(b) Standard deviation (TRUE)

(c) Range (TRUE)

(d) Correlation coefficient (FALSE)

(e) Mode (FALSE)

(3) In a medical paper 150 patients were characterised as 'Age 26 years ± 5 years (mean ± standard deviation)'. If the ages were normally distributed, this means that:

(a) It is 95% certain that the true mean lies in the interval 16-36 years (FALSE)

(b) Most of the patients were aged 26 yrs; the remainder were aged 21-31 yrs (FALSE)

(c) Approximately 95% of the patients were aged between 16 and 36 years (TRUE)

(4) A pharmacokinetic investigation, including 216 volunteers, revealed that the plasma concentration one hour after oral administration of 10mg of the drug was 188ng/ml ± 10ng/ml (mean ± standard error). This implies that:

(a) 95% of the volunteers had plasma concentrations between 168 and 208ng/ml (FALSE)

(b) The interval 168 to 208 ng/ml is the normal range of the plasma concentration 1 hour after oral administration. (FALSE)

(c) We are 95% confident that the true mean lies somewhere within 168-208 ng/ml (TRUE)

(5) The standard error of the mean of a sample:

(a) Can be calculated using the standard deviation of the population and the size of the sample (TRUE)

(b) Measures how far the sample mean is likely to be from the population mean (TRUE)

(c) Is greater than the standard deviation of the observations (FALSE)

(6) A new drug was tested independently in 2 randomised controlled trials. The trials appeared comparable and comprised the same number of patients. One trial led to the conclusion that the drug was effective ( p < 0.05), whereas the other trial led to the conclusion that the drug was ineffective ( p>0.05). The actual p-values were 0.041 and 0.097. This means that:

(a) The first trial gave a false -positive result (FALSE)

(b) The second trial gave a false-negative result (FALSE)

(c) Obviously, the trials were not comparable after all (FALSE)

(d) One can't attach too much importance to small differences between p-values (TRUE)

(7) A controlled trial of a new treatment led to the conclusion that it is significantly better than placebo; p<0.05. Therefore:

(a) It has been proved that the treatment is better than the placebo (FALSE)

(b) If the treatment is not effective, there is less than 5% chance of obtaining such results (TRUE)

(c) The observed effect is so large that there is less than a 5% chance that the treatment is no better than placebo (TRUE)

(8) Patients are asked for consent to partake in a study of a new treatment for hypertension. 95% agree to be randomised to one of three treatment groups. 90% of those that agree to partake continue on their allocated treatment for the specified time period, the main outcome measure is the change in each individual's blood pressure:

(a) The appropriate analysis is a one-way analysis of variance of the blood pressure changes. (TRUE)

(b) No information should be collected from those that refuse to participate (FALSE)

(c) The analysis should include even those that do not complete their allotted treatment. (TRUE)

(d) Two-sample t-tests should be used to determine significant differences between the groups. (FALSE)

(e) If a one-way analysis of variance gives a significant p-value, then t- tests should be used to identify which group is responsible for that difference. (FALSE)

(9) In a small randomized double-blind trial of a new treatment in acute myocardial infarction, the mortality in the treated group was half that in the control group, but the difference was not significant. We can conclude that:

(a) The treatment is useless (FALSE)

(b) There is no point in continuing to develop the treatment (FALSE)

(c) Reduction in mortality is so great that we should introduce the treatment (FALSE)

(d) The confidence interval for the difference between the groups will be wide (TRUE)

(e) We should carry out a new trial of much greater size (TRUE)

(10) Non-parametric statistical tests:

(a) Are preferred to methods which assume the data to be normally distributed (FALSE)

(b) Are less powerful than methods based on the normal distribution when the data are normally distributed (TRUE)

(c) Often involve ranking the data (TRUE)

(d) Are preferred when data cannot be assumed to follow any particular distribution (TRUE)

(11) A study is designed to examine the relationship between blood pressure and occupation. Which 2 of the following must occur for age to be a confounding factor?

(a) The correlation between age and blood pressure is significant (FALSE)

(b) Blood pressure changes with age (TRUE)

(c) Different occupational groups have different age structures (TRUE)

(d) Age is linked to diet and diet affects blood pressure (FALSE)

(e) Within a given occupational group, age and blood pressure are linearly related (FALSE)

(12) The diastolic blood pressures (DBP) of a group of young men are normally distributed with a mean of 70mmHg and a standard deviation of 10mmHg. It follows that:

(a) About 95% of the men have a DBP between 60 and 80 mmHg. (FALSE)

(b) About 50% of the men have a DBP above 70mmHg (TRUE)

(c) The distribution of DBP is not skewed (TRUE)

(d) All the DBPs must be less than 100 l/min (FALSE)

(e) About 2.5% of the men have DBP below 50 mmHg (TRUE)

(13) If the size of a random sample were increased, we would expect:

(a) The mean to decrease (FALSE)

(b) The standard error of the mean to decrease (TRUE)

(c) The standard deviation to decrease (FALSE)

(d) The sample variance to increase (FALSE)

(e) The accuracy of the parameter estimates to increase (TRUE)

(14) Randomisation of patients to treatments within a trial ensures that:

(a) The patient in unaware of the treatment group to which they are assigned. (FALSE)

(b) Each patient has an equal chance of being in any treatment group (TRUE)

(c) Treatment group is known before consent is obtained (FALSE)

(d) Although individuals receive different treatments, each patient will be allocated to the treatment most likely to benefit them (FALSE)

(e) Differences between treatments will be significant. (FALSE)

Individuals should be randomised to groups to remove any potential bias. Randomization means that each patient has the same chance of being assigned to either of the groups, regardless of their personal characteristics. Random does not mean haphazard or systematic. Hence, B is true and D is false.

It is blinding not randomisation that ensures that individuals do not know which groups they are in (A is false).

Randomisation should take place after consent is obtained (C is false).

Randomisation aims to ensure that the treatment groups are similar apart from the treatment under study; hence any differences in outcome are more easily attributable to being causally related to treatment than if the treatments were allocated according to some other system.

Randomisation helps to avoid potential confounders. So, whilst randomisation does not ensure that differences in treatments will be significant it does make it more likely that any significant differences that are seen are due to treatment (E is false).

(15) Within a clinical trial, where possible allocation of patients to treatments should be:

(a) Blinded (TRUE)

(b) Randomised (TRUE)

(c) Systematic (FALSE)

(d) Decided prior to obtaining consent (FALSE)

(e) Performed away from the study centre (TRUE)

Allocation to treatments should be random rather than systematic to avoid potential bias (B is true, C is false).

Ideally randomisation should be made via telephone so that the process cannot be influenced by any known features of the patient (E is true).

Consent should be obtained before treatment group is determined otherwise the approach taken to gaining consent and/or the patients' decision as to whether or not to consent may be affected by the planned allocation (D is false).

If the patient and/or the clinician/assessor know which treatment a patient is having then this may influence their recorded outcome. We say a study is blind when either the patient and/or the clinician/assessor do not know the treatment allocation. Single-blind is the term used when one of the two (patient, clinician/assessor) does not know the allocation but the other does; double-blind means that neither know (A is true).

(16) In a trial of vitamin supplementation on reaction times amongst 12 year olds, which of the following are necessary for social class to be a confounder?

(a) Reaction times differ between the supplemented and non-supplemented (control) groups (FALSE)

(b) The children were not randomised to treatment groups (TRUE)

(c) The supplemented group have a different social class distribution to the non-supplemented (control) group (TRUE)

(d) The non-supplemented (control) group are all children of professional parents (FALSE)

(e) Reaction times are associated with social class (TRUE)

For social class to be a confounder, it must be associated with the outcome, reaction times (E is true) and different between comparison groups (C is true).

Certain types of randomisation may help overcome confounders such as minimisation and stratification (B may be true) but simple randomisation will not do this.

A is false as this is not related to social class, D is also false as it is irrelevant.

(17) A new asthma inhaler is tested within a double-blind crossover trial. With this study design:

(a) Neither the patient nor the assessor know which treatment (standard/new inhaler) is being given at any time (TRUE)

(b) Any differences found between the new and standard inhaler must be significant (FALSE)

(c) The order of treatments (new/standard inhalers) should be randomised (TRUE)

(d) Less patients will be need than if a parallel trial of new versus standard inhalers had been used (TRUE)

(e) The outcome must be normally distributed. (FALSE)

In a crossover study, each patient receives treatment and placebo in a random order. Fewer patients are needed because many between-patient confounders may be removed; hence D is true.

The order of treatments should be randomised otherwise bias may be introduced if there is an order effect (C is true).

A study is double-blind if neither the patient, nor the researcher assessing the patients or the treating clinician, knows which treatment the patient has been randomized to receive. So, in the described study neither knows whether the standard or new inhaler is being used at any given time (A is true).

When the results are analysed the crossover design must be taken into account and the pairing of outcomes within patient (their time on new vs. their time on standard) retained. The particular form of the statistical analysis (parametric/non-parametric) will depend on the distribution of the outcome variable. If the outcomes are normally distributed then t-tests would be appropriate. E is false; the outcome may or may not be normally distributed, choosing a particular study design (crossover) does not make the outcome normally distributed.

Any differences found in outcome during the 2 treatment periods may or may not be statistically significant and/or clinically significant. There may be differences that are non-significant (B is false).

(18) Observational studies:

(a) Cannot be randomised (TRUE)

(b) Give more convincing evidence of true differences than experimental studies (FALSE)

(c) Are always large (FALSE)

(d) Can never be useful (FALSE)

(e) Must be blinded (FALSE)

In observational studies the groups being compared are already defined and the study merely observes what happens. Since groups (different diseases or different treatments) are already determined and known they cannot be blind (E is false) or randomised (A is true).

The size of the sample that is studied is determined by the researcher and could be small (C is false).

If a difference is found in an observational study then it is more likely to be due to confounding than in an experimental trial since groups were determined before the study. A difference found with an experimental design is more likely to be due to the treatment than in an observational study (B is false).

Despite the potential for confounding, the non-blind and non-randomised nature of observational studies, they can provide useful information. Interpretation of results from observational studies should take into account the limitations of the design (D is false).

(19) Age and sex matched pairs of patients are allocated to new or standard treatments:

(a) Age and sex cannot confound the study results (TRUE)

(b) Randomisation to new or standard treatment should take place within-pair (TRUE)

(c) The treatment allocations must be blinded (FALSE)

(d) The pairing should be retained in the analysis (TRUE)

(e) Disease severity will be similar between the treatment groups. (FALSE)

Confounding may be avoided by matching individuals in the groups according to potential confounders. Here, individuals of the same age and sex are allocated to new or standard treatments within pairs. Therefore the new and standard groups will have the same age and sex distribution and these two variables cannot be confounders (A is true).

Randomisation to new or standard treatment should take place within pair (B is true) and this pairing should be retained in the analysis (D is true).

Ideally treatment allocation should be blinded but it may not be possible to do so for these particular treatments (C is false).

All individuals of the same age and sex are not necessarily similar in their disease severity, hence matching for age and sex does not ensure that disease severity will be similar between the groups (E is false).

(20) Which of the following are categorical variables?

(a) Height (FALSE)

(b) Social class (TRUE)

(c) Age (FALSE)

(d) Gender (TRUE)

(e) Ethnicity (TRUE)

Data may be either categorical or numeric. With categorical variables each individual lies in one category, numeric data is measured on a number scale. Height and age are both numeric (A and C are false).

Social class falls into one of 5 or 6 categories depending on how it is defined (B is true). Gender is either make or female i.e., one of two categories (D is true) and ethnicity may be divided into varying numbers of categories (E is true).

(21) When data is ranked:

(a) The highest rank is equal to the total number in the sample (TRUE)

(b) The median is the middle ranked value (TRUE)

(c) The lowest value has rank 1 (TRUE)

(d) Any equal data values must be removed from the sample prior to ranking (FALSE)

(e) The mean of the values is the middle ranked value (FALSE)

Ranks give the order of increasing magnitude of numeric variables. The lowest value has rank 1 (C is true).

The highest value will have a rank equal to the total number in the sample e.g., if there are 10 values, then the ranks 1, 2, 3, ..., 9, 10 will be assigned and the largest value will have rank 10 (A is true).

Half of the values will be smaller than the middle ranked value and half will be larger, hence the middle ranked value is the median/ 50th percentile (B is true).

If the data is skew then the mean and median will be different and since the median is always the middle ranked value, the mean will not be equal to the middle ranked value if the data are skew (E is false).

Equal data values should be given equal ranks, to achieve this, the corresponding ranks will be averaged between the data values. Each value in the dataset should be given a rank, data values that are the same do not need to/should not be removed from the sample prior to ranking (D is false).

(22) An oxygenation index is measured in a group of 30 children less than 10 years of age. The values obtained range from 2 to 250; the median value is 27 and the mean 60:

(a) The distribution of the oxygenation measurements is upwardly skew (TRUE)

(b) The best average value to use is the mean (FALSE)

(c) The mean is heavily influenced by relatively few children with high oxygenation values (TRUE)

(d) This oxygenation index is not a reliable measurement (FALSE)

(e) More measurements need to be made (FALSE)

For symmetrically distributed data, the mean and median values are the same and will be mid-way between extreme values of the distribution. When data is skew then the mean is pulled in the direction of the skew away from the median. Skew is named according to the direction of the outlying tail. For the oxygenation index the mean is larger than the median and both are much closer to the lowest value (2) than to the highest (250). Hence it is reasonable to assume that most individuals have relatively low oxygenation values and there are a few with high values that have a large influence on the mean, making it much larger than the median (A is true, C is true).

Since the mean is influenced by a few large values it will not be representative of the bulk of the values and the median is a much better measure of what is average or representative of most individuals (B is false).

No measures of the precision of the mean and median estimates are given. The precision will depend on the sample size. Hence the group of 30 children may or may not give adequate information; we would need to estimate precision and see whether this is suitable for our needs. If greater precision is required then a larger sample would have to be taken (E is false).

Reliability is the extent to which the measurements would be replicated if re-taken (for example at a different time or by a different assessor). The values given are based on a single measurement in each child and this gives us no information about reliability (D is false).

(23) A group of 100 12 year old children with Fragile X syndrome undergo IQ testing. Their mean IQ is 97, standard deviation 5. The measurements are approximately normally distributed:

(a) The standard error of the measurements is 0.5 IQ points (TRUE)

(b) A 95 % confidence interval for the population mean IQ of 12 year old with Fragile X is (96,98) based on this sample (TRUE)

(c) Approximately 95% of the children have IQ measurements in the range 87 to 107 (TRUE)

(d) The mean IQ in this sample is significantly different from the expected mean of 100 amongst normal children (TRUE)

(e) The distribution of IQ is skew (FALSE)

Normally distributed data is symmetric and therefore not skew (E is false). Since the measurements are approximately normally distributed, then we would expect about 95% of them to lie within a range mean ± 2 standard deviations (section 2.4 p742). Hence, approximately 95% will lie in the range 97 ± 2(5) = 97 ± 10 = 87, 107 (C is true).

The standard error is calculated as the standard deviation divided by the square root of the sample size and the interval (mean ± 2 standard errors) is an approximate 95% confidence interval for the population mean. Hence for this sample, standard error = 5/sqrt(100) = 5/10 =0.5 (A is true) and a 95% confidence interval is given by (97 ± 2(0.5)) = (97 ± 1) = (96, 98) and B is true.

The confidence interval gives the range of population values that the sample data are compatible with. In this case the interval (96, 98) excludes the expected mean of 100 found amongst normal children. Hence the mean IQ in the sample is significantly different from 100 (D is true).

(24) The standard error of an estimate:

(a) Is smaller for larger sample sizes (TRUE)

(b) Is a measure of the precision of that estimate (TRUE)

(c) Cannot be negative (TRUE)

(d) Depends on the average value of the sample (FALSE)

(e) Is used to construct confidence intervals (TRUE)

The standard error is a measure of how precisely the sample value approximates the true population value (B is true) and for continuous data is calculated as the standard deviation divided by the square root of the sample size.

Confidence intervals can be constructed around the sample estimate using the standard error (E is true).

Precision will obviously be greater/better for larger sample sizes and we can also see this from the formula for calculating standard error. The standard deviation is divided by the square root of the sample size, hence as the sample size gets larger we are dividing by a bigger number and the standard error will be smaller indicating greater/better precision. Also, since the standard deviation is always positive, the standard error must also be positive (A is true, C is true).

Although it depends on the spread of the values around the average (i.e., the standard deviation) it does not depend on the average itself, being a measure of the precision of that average (D is false).

(25) A p-value:

(a) Indicates the statistical significance of any differences seen in the sample(s) (TRUE)

(b) Lies between -1 and +1 (FALSE)

(c) Is more useful in interpreting results than a confidence interval (FALSE)

(d) Is the probability of obtaining the current sample if the null hypothesis is true (TRUE)

(e) Indicates the clinical significance of any differences seen in the sample(s) (FALSE)

The p-value is the probability of obtaining the current sample if the null hypothesis were true (D is true) and gives a measure of the statistical significance of any differences seen (A is true).

As it is a probability it can range from 0 (no probability/never happens) to 1 (certainty/always happens) but cannot be negative (B is false).

The p-value gives an indication of how likely one particular hypothesised value is to be true, the confidence interval gives the range of hypothesised values with which the sample is compatible. Hence confidence intervals give much greater information and facilitate clinical interpretation of the results (C is false).

The p-value gives statistical significance, clinical significance will depend additionally on other factors such as inconvenience associated with treatment, level of improvement or difference and costs (E is false).

(26) A parametric correlation coefficient:

(a) Must be positive (FALSE)

(b) Of zero indicates no relationship between the measurements (FALSE)

(c) Takes the value 1 only if the points lie on the line of equality (FALSE)

(d) Shows the extent to which two continuous measurements are linearly related (TRUE)

(e) Is negative if there is no association (FALSE)

The most commonly used correlation coefficient is the Pearson's correlation and this is parametric. It gives a measure of the linear association between two continuous measurements. Non-parametric correlation coefficients (Spearman or Kendall) measure the tendency for one variable to fall or rise as the other increases; whether this tendency is linear or not (D is true).

All correlation coefficients can take values between -1 and +1 (A is false).

Negative values indicate that as one of the variables increases the other decreases (E is false).

A value of 1 indicates a perfect positive relationship and if the correlation coefficient is parametric then this would mean that the points lie on a straight line, however the line is not necessarily the line of equality (C is false).

A Pearson correlation coefficient of zero indicates that there is no linear association between the variables but there may still be a non-linear association (B is false).

(27) Reflex times are measured in a group of 5-15 year olds. The correlation between reflex time and age is calculated as 0.76 (95% confidence interval 0.7, 0.82):

(a) The correlation coefficient is significantly different to zero (TRUE)

(b) More measurements need to be made (FALSE)

(c) There is a linear association between reflex times and age (TRUE)

(d) The association is clinically important (FALSE)

(e) Older children tend to have slower reflex times (TRUE)

The correlation coefficient gives a measure of linear association between reflex times and age. Zero indicates no linear association, the closer the value is to + or - 1, the stronger the linear association. For this sample, the value is 0.76 and this indicates some linear association (C is true).

The coefficient is positive and this shows that age and reflex time both increase together i.e., older children have slower/longer reflex times (E is true).

The confidence interval for the correlation (0.7, 0.82) does not contain zero and the correlation is therefore significantly different to zero (A is true).

Whether or not more measurements need to be made depends on whether the precision obtained for the estimate is suitable for whatever purpose it was made (B is false).

The association is statistically significant, whether it is clinically important depends additionally on other factors. For example, the extent to which reflex time changes with age and/or whether there are practical implications for having a faster reflex time (D is false).

(28) Blood pressures are measured in 2 groups. Those receiving some treatment have blood pressures that are on average 6 mmHg lower than those in the untreated group. A t-test was applied and p=0.02, 95% confidence interval for the difference (4.16, 7.84) mmHg:

(a) The treatment should be introduced as it may be clinically relevant (FALSE)

(b) The treatment must have improved blood pressure by at least 4.16 mmHg on average in the population (FALSE)

(c) The difference observed would have occurred by chance one time in 20 if there really were no treatment effect (FALSE)

(d) Randomisation to groups was not successful (FALSE)

(e) The t-test would not have been appropriate if the blood pressure measurements were skew (TRUE)

The difference observed is statistically significant and would have occurred by chance 2 times in 100 (since p=0.02) or 1 time in 50 if there really were no treatment effect (C is false).

We cannot tell from the information given whether or not randomisation was successful (D is false).

The sample is compatible with average falls in blood pressure of between 4.16 and 7.84 mmHg. We are 95% confident that the population average fall lies within this interval, however it may not (in fact 5% of the time it won't) so B is false.

Statistical significance does not necessarily imply clinical significance or relevance. Clinical relevance is determined by factors other than statistical significance. For example whether the reduction observed is of clinically meaningful, whether introducing the treatment to gain that sort of difference is worthwhile - which will in turn depend on the cost and inconvenience of the new treatment (A is false).

The t-test is not valid if the measurements are not normally distributed, skew data is not normally distributed (E is true).

(29) The numbers of children positive for a certain genetic defect were compared between asthmatics and healthy controls of a similar age:

(a) The study is invalid because it is not randomised (FALSE)

(b) A t-test could be used to assess the significance of the differences between groups (FALSE)

(c) Chi-square could be used to test the significance of the differences in proportions positive amongst the asthmatic and healthy groups (TRUE)

(d) Age may be a confounder (FALSE)

(e) A confidence interval for the difference in proportions positive would assist in interpretation of the results (TRUE)

This is an observational study and cannot be randomised; we cannot allocate asthma randomly (A is false).

For a variable to be a confounder it must differ between the groups being compared and also affect outcome. Since the asthmatic and healthy children are of similar ages, age cannot be a confounder in the comparison (D is false).

Presenting the difference in proportions positive in the two groups together with a confidence interval to show the precision with which this difference is estimated would be informative (E is true). T-tests are appropriate for continuous numeric outcomes (B is false).

Chi-square can be used to compare differences in proportions (C is true).

(30) Within a randomised controlled trial, of 100 individuals who received standard care, 40 had an adverse event in the following year this is in contrast to 20 of 100 who received a program of intensified care:

(a) Percentage reduction in adverse events attributable to intensified care = 20 (TRUE)

(b) The number need to treat (NNT) to avoid one adverse event = 5 (TRUE)

(c) Relative risk (RR) = 0.5 (TRUE)

(d) A confidence interval for the percentage reduction will depend on the sample size (TRUE)

(e) Age may be a confounder in the comparison (TRUE)

The percentage who suffered an adverse event fell from 40 to 20, hence the percentage reduction was 20 (A is true).

With a 20% fall this means that an extra 5 individuals would need to receive intensified care for 1 to avoid an adverse event. Hence the number needed to treat (NNT) = 5 (B is true).

A confidence interval around the percentage reduction (20) will take into account the sample size - the greater the sample size the more precise the estimate of the difference attributable to intensified care and the narrower the confidence interval (D is true).
The relative risk = risk in intensified care group divided by risk in standard care group = 20/40 = 0.5 (C is true).

If the groups (standard and intensified care) differ in their age distribution and age affects outcome (adverse event yes/no) then age would be a confounder in the comparison (E is true).

(31) The average (mean) BMI score of 30 children with Turners syndrome is 20 and the standard deviation is 3:

(a) If the BMI values are normally distributed, then most of the measurements will lie between 14 and 26 (TRUE)

(b) None of the children are clinically obese (FALSE)

(c) Plotting BMI against age will be informative (FALSE)

(d) Normality of BMI scores should be considered when interpreting these results (TRUE)

(e) Thirty children is too few to give any useful information (TRUE)

If the measurements are normally distributed then approximately 95% will be in an interval mean ± 2 standard deviations. In this case, mean BMI = 20, SD = 3, so we would expect approximately 95% to lie within the interval 20 ± 2(3) = 20 ± 6 = 14, 26 (A is true).

Although most of the children would lie in this range, 5% would be outside. If the data are skew then the mean and standard deviation will not be very useful summaries of the data (D is true).

We cannot say whether any of the children in the sample are clinically obese (BMI > 30) from the information given (B is false)

Plotting BMI against age may or may not be informative (C is false).

Whether there are enough children to give useful information depends on how precisely we needed to estimate the average BMI in this group. We would need to consider the confidence interval around the sample mean to see whether this was sufficiently narrow for our purposes and, if not, we would need to take a larger sample (E is true).

Best of Five Questions

In these questions a series of 5 statements follows a lead introduction. One or more of the statements may be correct. You should select the ONE best answer i.e., if there are 2 correct responses, then the best of these must be chosen.

(1) One hundred children with fragile X have their heights measured, and heights are expressed as SD-scores (i.e., corrected for age). The mean score is -0.8 and standard deviation 0.15. On the basis of this sample, it can be concluded that the population average SD score for fragile X children lies with 95% confidence within the interval:

(a) (-0.5, 1.1)

(b) -0.8

(c) (-0.496, 1.087)

(d) (-0.83, -0.77)

(e) (-0.845, -0.765)

(2) A positive correlation is found between weight and systolic blood pressure amongst a group of 10 year old children (r=0.7, 95% CI (0.52, 0.88), p=0.0004). It can be concluded that:

(a) Before weight loss can occur, systolic blood pressure must be lowered

(b) Those with high systolic blood pressures are more likely to be obese

(c) There is a significant tendency for heavier 10 year olds to have higher systolic blood pressures

(d) Weight affects systolic blood pressure

(e) There is a non-significant tendency for heavier 10 year olds to have higher systolic blood pressures

(3) A randomised controlled trial of a new therapy for hypertension shows a statistically significant difference (p=0.0012). The group who received the new therapy tended to have lower BMIs and this may have contributed to the difference observed in blood pressure at the end of the study. The appropriate interpretation/action would be:

(a) All those with BMI > 30 should be excluded and the analysis re-done

(b) The new therapy is significantly better, no further action need be taken, it should be introduced as standard

(c) Since the trial is randomised, the result must be correct and differences in BMI can be ignored

(d) The data should be re-analysed for low and high BMI

(e) BMI may be a confounder, it should be corrected for in the analysis.

(4) Children with Otitis Media are randomised to either a long or short course of antibiotics. The numbers who have recurrent attacks within the following 12 months are compared. The appropriate statistical test to make this comparison is:

(a) Analysis of variance

(b) Chi-square

(c) Students t-test

(d) Mann-Whitney U test

(e) Regression analysis

(5) Blood pressure is measured in a randomly selected group of 100 teenagers. The median blood pressure is 68mmHg, mean 70mmHg, standard deviation 7 mmHg. This information implies that the blood pressures:

(a) Are normally distributed; approximately 95% lie within the range (56, 84 mmHg)

(b) Are non-normally distributed; approximately 95% lie within the range (56, 84 mmHg)

(c) Are normally distributed; approximately 95% lie within the range (68.6, 71.4 mmHg)

(d) Are non-normally distributed; approximately 95% lie within the range (68.6, 71.4 mmHg)

(e) Were not accurately measured.

(6) In children with renal failure, a study shows that vitamin D levels are found to be severely depleted (p<0.0001). Which of the following would be the most appropriate course of action based on this study?

(a) Introduce vitamin D supplementation as standard practice

(b) Need to consider extent of depletion and costs of supplementation, also clinical implications, make decision based on these

(c) Re-analyse the data taking into account the ages of the children

(d) Carry out a further study of greater size

(e) Nothing

(7) Cirrhotic children between the ages of 6 and 10 years are randomised to a new dietary regime or standard advice. After 2 years their height SD-scores are compared. The group allocated to the new diet have a higher mean SD-score for height (difference = 0.2, 95% confidence interval (-0.8, 1.2)) but this difference is non-significant (p=0.52). An improvement of 0.2 SD scores over a 2 year period would be considered clinically important in this group of children. The appropriate course of action based on this study would be:

(a) Do nothing further, the study has shown the new diet is not statistically significantly better than current practice

(b)Re-analyse the data using non-parametric methods

(c) Follow the children for longer to try and obtain statistical as well as clinical significance

(d) Carry out a trial of a larger size to obtain a more precise estimate of the effect of the new diet compared to standard

(e) Introduce the new diet as standard practice; the average improvement is clinically important

(8) Haemoglobin measurements were made in small groups of children with 5 different syndromes. In order to assess whether there are differences between the groups that are unlikely to have occurred by chance, which of the following should be done?

(a) A further study of much larger size

(b) Analysis of variance comparing means between the groups

(c) The data plotted according to the syndromic group

(d) Mann-Whitney U-tests between each pair of the syndromic groups

(e) Non-parametric analysis of variance comparing medians between groups (Kruskal-Wallis)

(9) Concurrent control groups are useful when performing studies because:

(a) They allow the use of statistical tests for the comparison of two groups (e.g., two sample t-tests)

(b) They help to ensure that any differences seen are due to the treatment or disease being studied

(c) They allow the study to be blinded

(d) They help boost the overall numbers studied

(e) They are better than historical controls

Best of Five Solutions

(1) One hundred children with fragile X have their heights measured, and heights are expressed as SD-scores (i.e., corrected for age). The mean score is -0.8 and standard deviation 0.15. On the basis of this sample, it can be concluded that the population average SD score for fragile X children lies with 95% confidence within the interval:

(a) (-0.5, 1.1)

(b) -0.8

(c) (-0.496, 1.087)

(d) (-0.83, -0.77) (TRUE)

(e) (-0.845, -0.765)

The 95% confidence interval is constructed approximately from the mean ± 2 standard errors. The standard error can be calculated by dividing the standard deviation by the square root of the sample size. Hence, in this question, standard error = 0.15/sqrt(100) = 0.15/10 = 0.015; and the 95% confidence interval = (-0.8 ± (2 x 0.015)) = (-0.8 ± 0.03) = (-0.83, -0.77); therefore, D is correct.

Least correct is B (only 1 potential confidence limit is given);

Nearest to correct is A, where the 95% range (mean ± 2sd) is given rather than the 95% confidence interval

There is no basis for choosing C or E although these might look appealing because of the degree of precision.

(2) A positive correlation is found between weight and systolic blood pressure amongst a group of 10 year old children (r=0.7, 95% CI (0.52, 0.88), p=0.0004). It can be concluded that:

(a) Before weight loss can occur, systolic blood pressure must be lowered

(b) Those with high systolic blood pressures are more likely to be obese

(c) There is a significant tendency for heavier 10 year olds to have higher systolic blood pressures (TRUE)

(d) Weight affects systolic blood pressure

(e) There is a non-significant tendency for heavier 10 year olds to have higher systolic blood pressures

The positive correlation indicates that there is a positive association between weight and systolic blood pressure (heavier people tend to have higher blood pressures). The p-value shows that the association is significant. (C is correct).

B may be correct but there is insufficient information given to conclude this. An intervention (rather than observational) study would be require to deduce whether A is correct. Since the study is observational, a causal relationship cannot be inferred (D is incorrect).

E is the least correct as the p-value has been completely misinterpreted.

(3) A randomised controlled trial of a new therapy for hypertension shows a statistically significant difference (p=0.0012). The group who received the new therapy tended to have lower BMIs and this may have contributed to the difference observed in blood pressure at the end of the study. The appropriate interpretation/action would be:

(a) All those with BMI > 30 should be excluded and the analysis re-done

(b) The new therapy is significantly better, no further action need be taken, it should be introduced as standard

(c) Since the trial is randomised, the result must be correct and differences in BMI can be ignored

(d) The data should be re-analysed for low and high BMI

(e) BMI may be a confounder, it should be corrected for in the analysis. (TRUE)

D is the next nearest correct, but the data does not need to be dichotomised into low and high BMI in order to investigate whether BME is a confounder, dichotomising will lose information, it is preferable to leave as a continuum.

Similarly A may help in the interpretation but we do not know how many are affected and whether this action will remove the problem. If we were sure that BMI was not associated with blood pressure, then B would be correct.

C is incorrect since differences in the BMI distribution are known to have occurred by chance despite randomisation.

(4) Children with Otitis Media are randomised to either a long or short course of antibiotics. The numbers who have recurrent attacks within the following 12 months are compared. The appropriate statistical test to make this comparison is:

(a) Analysis of variance

(b) Chi-square (TRUE)

(c) Students t-test

(d) Mann-Whitney U test

(e) Regression analysis

There are two groups to compare (long/short course) hence A is the least correct (analysis of variance is appropriate when there are more than two groups).

Since a single outcome (recurrent attacks) is to be compared between the groups, regression analysis is inappropriate (answer E) and this is the next least correct.

The outcome (recurrent attacks: yes/no) is binary and hence C and D which both apply to numeric outcomes are incorrect.

The correct answer is B; the proportions with recurrent attacks would be compared between those given long and short courses of antibiotics using chi-square.

(5) Blood pressure is measured in a randomly selected group of 100 teenagers. The median blood pressure is 68mmHg, mean 70mmHg, standard deviation 7 mmHg. This information implies that the blood pressures:

(a) Are normally distributed; approximately 95% lie within the range (56, 84 mmHg) (TRUE)

(b) Are non-normally distributed; approximately 95% lie within the range (56, 84 mmHg)

(c) Are normally distributed; approximately 95% lie within the range (68.6, 71.4 mmHg)

(d) Are non-normally distributed; approximately 95% lie within the range (68.6, 71.4 mmHg)

(e) Were not accurately measured.

There is no evidence from the information given that blood pressures were measured inaccurately (E is the least correct).

Since the mean and median are approximately equal, this implies that the blood pressures are normally distributed (B and D are incorrect).

Approximately 95% will lie within the range mean ± 2 standard deviations (= 70 ± 2.7 = 56, 84). The approximate 95% confidence interval is given by mean ± 2 standard errors. Standard error = standard deviation/sqrt(sample size) = 7/sqrt(100) = 7/10 = 0.7 and the approximate 95% confidence interval = 70 ± 2(0.7) = 68.6, 71.4. Hence D is more incorrect than B, C is incorrect and A is correct.

(6) In children with renal failure, a study shows that vitamin D levels are found to be severely depleted (p<0.0001). Which of the following would be the most appropriate course of action based on this study?

(a) Introduce vitamin D supplementation as standard practice

(b) Need to consider extent of depletion and costs of supplementation, also clinical implications, make decision based on these (TRUE)

(c) Re-analyse the data taking into account the ages of the children

(d) Carry out a further study of greater size

(e) Nothing

There is a statistically significant difference as shown by the p-value. We are not told who the renal failure children are compared with to get that value (e.g., concurrent healthy controls, established reference range, etc.). How we interpret the results will depend on who the comparison was made with. The difference seen is statistically significant, but this may or may not be associated with a clinically important difference, although the fact that the question states that values are 'severely' depleted suggests a clinically relevant reduction has occurred. There may or may not be other factors, such as age, that need to be taken into account when interpreting the results. The study was presumably undertaken to answer some research question, the answer to which would inform clinical practice. Therefore we do not expect to do nothing after obtaining the trial results, on the other hand we do not want to introduce (or even trial) supplementation without first considering the clinical relevance and implications for the reduction found.

The p-value shows that the study is large enough that the observed difference cannot be attributed to chance. Hence, the most correct answer is B.

(7) Cirrhotic children between the ages of 6 and 10 years are randomised to a new dietary regime or standard advice. After 2 years their height SD-scores are compared. The group allocated to the new diet have a higher mean SD-score for height (difference = 0.2, 95% confidence interval (-0.8, 1.2)) but this difference is non-significant (p=0.52). An improvement of 0.2 SD scores over a 2 year period would be considered clinically important in this group of children. The appropriate course of action based on this study would be:

(a) Do nothing further, the study has shown the new diet is not statistically significantly better than current practice

(b)Re-analyse the data using non-parametric methods

(c) Follow the children for longer to try and obtain statistical as well as clinical significance

(d) Carry out a trial of a larger size to obtain a more precise estimate of the effect of the new diet compared to standard (TRUE)

(e) Introduce the new diet as standard practice; the average improvement is clinically important

The average improvement seen is clinically relevant so we would not just want to discount the information because it is statistically non-significant. The confidence interval for the difference is wide and shows that the data is compatible with the new dietary regime having no, or an adverse, effect on height and also with clinically relevant improvements (up to 1.2 SD scores). Since the diet could be associated with a detrimental or zero effect based on the study results, it would not be reasonable to introduce it as standard purely because the average effect is good. The children could be followed for longer to see whether the effect becomes larger and statistically significant but this would not answer the question of whether an improvement can be seen over 2 years. SD scores are usually normally distributed so it is unlikely, although not impossible, that non-parametric methods would be needed.

The normality of the scores should have been verified prior to parametric testing. A larger trial would enable a more precise estimate of the effect of the new diet over a 2 year period to be obtained and this would be the best course of action (D is the most correct answer).

(8) Haemoglobin measurements were made in small groups of children with 5 different syndromes. In order to assess whether there are differences between the groups that are unlikely to have occurred by chance, which of the following should be done?

(a) A further study of much larger size

(b) Analysis of variance comparing means between the groups

(c) The data plotted according to the syndromic group

(d) Mann-Whitney U-tests between each pair of the syndromic groups

(e) Non-parametric analysis of variance comparing medians between groups (Kruskal-Wallis) (TRUE)

The groups are small so it is likely that parametric methods are not appropriate. Before embarking on a formal analysis of the differences, the haemoglobin measurements should be plotted according to syndrome group. This plot will allow some assessment of the normality of the measurements. Testing between pairs of groups will enable significant differences to be identified but does rely on multiple tests and the p-values obtained will not be valid without adjustment. It would be preferable to perform one overall test of the significance of the differences observed between groups. It may be that the study is not large enough to identify differences of clinical importance and a larger sample is required. This will become apparent from the plot, significance test, and confidence intervals. Therefore, the most correct answer is E.

(9) Concurrent control groups are useful when performing studies because:

(a) They allow the use of statistical tests for the comparison of two groups (e.g., two sample t-tests)

(b) They help to ensure that any differences seen are due to the treatment or disease being studied (TRUE)

(c) They allow the study to be blinded

(d) They help boost the overall numbers studied

(e) They are better than historical controls

If there is no control group then it will not be possible to say whether any effects/outcomes seen in the diseased or treated group are due to the disease or treatment; hence, a control group is necessary. If a historical control group (i.e., a group previously measured/assessed) is used then we cannot be sure that any difference is not due to factors that have changed over time (e.g., improvement in diet or clinical care). Concurrent controls are therefore preferable. Using concurrent controls will remove some of the potential confounders. We want the controls to be similar to the treatment/disease group so that any differences observed are more likely to be causally attributed to the treatment/disease. If the groups are blind to treatment then treatment knowledge does not differ between groups and so this is a similarity that we want to have (where ethically and feasibly possible). The most correct answer is B.

Extended Matching Questions

In these questions a series of up to 10 'items' are given, followed by a 'lead-in' and then a series of 'stems'. For each of the stems, you should choose the correct item from the list given the information in the lead-in. Each item may be used more than once.

(1) Items:

(i) Two sample t-test, (ii) Paired t-test, (iii) Mann-Whitney U test, (iv) One-way analysis of variance, (v) Kruskal-Wallis analysis of variance, (vi) Regression analysis, (vii) Correlation coefficient, (viii) Chi-square.

For each of the following study scenarios choose the most appropriate statistical test from the list above to analyse the data

Stems:

(a) Blood pressure measurements are made in a group of children with pituitary hormone disorders and age and sex-matched control pairs. The study aims to investigate whether pituitary hormone disorders are associated with altered blood pressure.

(b) Developmental tests are applied to determine whether children who were admitted to intensive care in the neonatal period are more likely to have delayed development at age 5 than those who were not.

(c) Blood pressures (assumed normally distributed) are compared between 5 year olds from 4 different clearly defined racial backgrounds

(2) Items:

(i) The difference is statistically and clinically significant. The new cream should not be introduced.

(ii) The difference is statistically significant but difference is clinically small. The new cream should be introduced.

(iii) A larger study is required to determine whether it is worth introducing the new cream

(iv) The difference between the creams is both statistically and clinically significant. The new cream should be introduced.

(v) The difference is statistically significant but the difference is not of clinical importance and the new cream should not be introduced.

(vi) The study provides enough evidence to discount the usefulness of the new cream

(vii) The study is invalidated by the drop-outs

(viii) The results cannot be interpreted because the analysis used was inappropriate.

A randomised controlled trial is used to compare the effectiveness of a new cream (T) for treating eczema compared to the current alternative cream (C) for children between 5-10 years of age. Severity is rated on a 0 (no rash) to 10 (severe rash) scale. An average fall of 2 points on the severity scale attributable to the new cream would be deemed of clinical importance and worth changing to the new cream to achieve.

For the following study results choose the most appropriate interpretation from the list above.

Stems:

(a) Those allocated to the new cream have an average rating of 5.4 compared with 7.8 for those on the current alternative (95% confidence interval for the difference (-3.6, -1.2), p< 0.0005).

(b) Those allocated to the new cream have an average rating of 5.4 compared with 7.8 for those on the current alternative (95% confidence interval for the difference (-6.0, 1.2), p=0.23).

(c) Of the 40 children allocated to the two creams, 30 who used the new cream had an average severity rating of 5.4. The 50 children who used the current treatment (40 randomised to this treatment plus the 10 who did not use the new cream but reverted to current) had an average rating of 7.8. The 95% confidence interval for the mean fall in severity rating (-2.4) was (-2.6, -1.2), p< 0.0005.

Extended Matching Solutions

(1) Items:

(i) Two sample t-test, (ii) Paired t-test, (iii) Mann-Whitney U test, (iv) One-way analysis of variance, (v) Kruskal-Wallis analysis of variance, (vi) Regression analysis, (vii) Correlation coefficient, (viii) Chi-square.

For each of the following study scenarios choose the most appropriate statistical test from the list above to analyse the data

Stems:

(a) Blood pressure measurements are made in a group of children with pituitary hormone disorders and age and sex-matched control pairs. The study aims to investigate whether pituitary hormone disorders are associated with altered blood pressure.

There are two groups of children (those with pituitary hormone disorders and their age-sex matched control pairs). Hence a two sample test for comparison between groups is appropriate (i, ii, iii or viii).

Outcomes are continuous numeric (blood pressure) and it is the within pair difference that will be analysed (blood pressure for child with disorder minus blood pressure for age and sex matched pair). Hence the test must be appropriate for continuous outcome data (ie. not viii). Since it is within pair differences that are to be analysed, these are likely to be normally distributed. The appropriate test to use is the paired t-test (ii).

(b) Developmental tests are applied to determine whether children who were admitted to intensive care in the neonatal period are more likely to have delayed development at age 5 than those who were not.

There are two groups of children (those admitted to intensive care and those not). Hence a two sample test for comparison between groups is appropriate (i, ii, iii or viii)..Outcome is binary (i.e., categorical with two categories; developmentally delayed: yes/no).

The proportion with developmental delay is to be compared between those admitted to intensive care or not. The appropriate test for comparing proportions between two groups is chi-square (xiii is correct).

(c) Blood pressures (assumed normally distributed) are compared between 5 year olds from 4 different clearly defined racial backgrounds.

There are 4 groups of children to be compared (from different racial backgrounds). Hence a test for simultaneous comparison between more than two groups is appropriate (iv, v or xiii). The outcome (blood pressure) is continuous (hence xiii not appropriate) and normally distributed and hence parametric testing should be used (iv is correct).

(2) Items:

(i) The difference is statistically and clinically significant. The new cream should not be introduced.

(ii) The difference is statistically significant but difference is clinically small. The new cream should be introduced.

(iii) A larger study is required to determine whether it is worth introducing the new cream

(iv) The difference between the creams is both statistically and clinically significant. The new cream should be introduced.

(v) The difference is statistically significant but the difference is not of clinical importance and the new cream should not be introduced.

(vi) The study provides enough evidence to discount the usefulness of the new cream

(vii) The study is invalidated by the drop-outs

(viii) The results cannot be interpreted because the analysis used was inappropriate.

A randomised controlled trial is used to compare the effectiveness of a new cream (T) for treating eczema compared to the current alternative cream (C) for children between 5-10 years of age. Severity is rated on a 0 (no rash) to 10 (severe rash) scale. An average fall of 2 points on the severity scale attributable to the new cream would be deemed of clinical importance and worth changing to the new cream to achieve.

For the following study results choose the most appropriate interpretation from the list above.

For all sections (i) and ii cannot be correct; if the new cream is found to be both statistically and clinically significant then this means that the difference observed is unlikely to be due to chance and is also large enough to make it clinically relevant.
The study shows the average difference attributable to the new cream must be large enough to be of clinical importance after taking into account all other factors (e.g., cost, ease of use, problem associated with introducing new treatment). Hence in this scenario, the new cream should be introduced.

Stems:

(a) Those allocated to the new cream have an average rating of 5.4 compared with 7.8 for those on the current alternative (95% confidence interval for the difference (-3.6, -1.2), p< 0.0005).

The difference is statistically significant since the p-value (<0.0005) is small. The average fall of 2.4 points on the severity scale is larger than deemed enough to be of clinical importance. However, the confidence interval shows that the data is compatible with a difference of between 1.2 and 3.6 points. An average fall of 1.2 would not be deemed clinically important enough to warrant changing to the new cream. So the data is compatible with outcomes that are synonymous with differing courses of action. For a fall of 1.2 to less than 2 points on average, the cream would not be introduced, whereas an average fall between 2 and 3.6 would lead to introduction of the cream. A larger study needs to be done to reduce the width of the confidence interval and gain a more precise estimate of the value of the new cream (iii is correct).

(b) Those allocated to the new cream have an average rating of 5.4 compared with 7.8 for those on the current alternative (95% confidence interval for the difference (-6.0, 1.2), p=0.23).

The difference is statistically non-significant since the p-value (0.23) is not small (ii, iv and v cannot be correct). The confidence interval shows that the sample data is compatible with an average change of anywhere between a 6 point drop in favour of the new cream and it making the rash 1.2 points worse on average. Hence we cannot discount scenarios that would lead to introduction of the new cream (ie. a fall of between 6 and 2 on average on the severity scale). Neither can we discount the fact that the new cream does not have a clinically important effect (difference may be less than 2 points and the p-value and confidence interval both show data is compatible with no difference between creams). Hence a larger study needs to be done to distinguish between differences of clinical relevance and not (iii is correct).

(c) Of the 40 children allocated to the two creams, 30 who used the new cream had an average severity rating of 5.4. The 50 children who used the current treatment (40 randomised to this treatment plus the 10 who did not use the new cream but reverted to current) had an average rating of 7.8. The 95% confidence interval for the mean fall in severity rating (-2.4) was (-2.6, -1.2), p< 0.0005.

The difference, confidence intervals and p-value are the same as in (a). The observed difference is statistically significant but may or may not be of a clinically relevant magnitude. However, the children allocated to the new cream but not using it have been combined with those in the other allocation group. Hence the groups are no longer randomly selected. Those that changed treatment from their allocation may differ in some way that biases the results. The data should have been analysed on an intention-to-treat basis (i.e., outcomes compared according to allocated group rather than according to the treatment actually used). This flaw in the analysis makes it impossible to interpret the results as we cannot assess the extent of any bias (answer H is correct).

Longer Exercises

These questions tend to use real life scenarios and ask you to combine knowledge from different sections to give a comprehensive response. These questions are mostly taken from previous MSc examinations. They are in no particular order of difficulty.

(1) Define a confounder. In a study to determine whether hearing loss in early life is associated with a change in average IQ at age 10, what might be potential confounders? How could these be dealt with in the study design?

(2) For any 3 of the following give examples of studies where one might analyse data using:

(a) 2 sample t-test
(b) Single test of proportion
(c) Paired t-test
(d) One-way ANOVA
(e) Mann-Whitney U test

Note: Do not just use example studies from teaching notes; make up your own hypothetical studies e.g., for 'comparing two proportions' answer could be 'A study to compare incidence of hearing problems between a group of children who had meningitis in infancy and an unmatched group who were meningitis-free'.

(3) Give reasons why a trial of a new therapy amongst children with severe hearing loss might need to be:

(a) Randomised
(b) Blinded
(c) Matched
(d) Placebo controlled

(4) A study was undertaken to assess a new treatment for acute OME. Pairs of patients were matched for age and sex. The first of each pair to attend the clinic was given the new treatment. The speech scores at age 5 of those that completed the treatment were compared with those who received no treatment or who did not complete the course of treatment. A two sample t-test was used (p=0.049). It is concluded that the new treatment is effective and should become the standard treatment. Comment on the study design, the results and the conclusion presented. Outline any changes that you would make in the study design.

(5) Write an outline protocol for a study to evaluate the effectiveness of hearing aids in the management of children with cleft palate.

(6) A researcher wants to understand the problems associated with patients attending scheduled hospital appointments. Initially he wants to ask some open questions (for example, 'Did you have any problems with attending today?", "What were they?"). He decides to call into the waiting room several times during the day and select a waiting patient to complete the questions (which should take about 30 minutes in total). Will his sample be representative of patients with scheduled appointments? Discuss the choice of sample and any potential alternatives (give their pros and cons too).

(7) Response to a treatment is recorded as positive or negative. The treated group is compared to a group given no treatment. What outcome will be compared between groups? What will the null hypothesis be? Outline briefly what statistical analyses should be done and how the results will be interpreted. Explain the influence that the selection of patients for treatment or not and their subsequent treatment regime may have on interpretation of results.

(8) Briefly explain 3 of the following terms:

(a) Confidence interval
(b) p-value
(c) Significance test
(d) Power of study

(9) A study aims to investigate the effect of hearing loss in early life on IQ at age 12. Children are recruited from two local schools: one for deaf children and the other for children with special needs and/or other disabilities. Pairs of sex and social class matched children are chosen from the schools and their IQs assessed. On average the deaf children have IQs 20 points lower (95% confidence interval (-2, 42), p=0.13, two-sample t-test).

Does this study support the hypothesis that hearing loss affects later IQ? Comment on the design of the study and analyses undertaken. Outline any changes that you would make to a subsequent study design.

(10) A new therapy is thought to cause increased dizziness. Write an outline protocol for a study designed to determine whether the prevalence increases from 10% amongst those on standard therapy to 30% for those given the new treatment.

(11) The quality of life, self assessed as a continuous measurement within the range 0-10, is compared between individuals with and without tinnitus. What would be an appropriate statistical test for assessing the difference in average quality of life between the groups? It is thought that age and sex may influence the measurements. What is the appropriate analysis for determining whether quality of life differs in tinnitus patients after accounting for age and sex differences between the groups?

(12) Gender related differences in hearing threshold are shown in Table 1 below.

Ref: Erlandsson & Hallberg "Prediction of quality of life in patients with tinnitus" Br J Audiol 2000; 34: 11-20.

Comment on the presentation of this data. Suggest a suitable graphical means of display.

(13) A study was undertaken to try and ascertain the developmental consequences of multiple ear infections in infancy.

The parents of children in year 3 (7-8 year olds) of 12 local primary schools are sent leaflets asking for their participation if their child had at least 3 bouts of ear infection requiring antibiotics in the first 2 years of life.

Twenty three parents agreed that their child could participate and these children were given performance and verbal IQ tests.

The average performance IQ scores of the 23 children was 93 (87,99), p=0.049, significantly lower than the expected average of 100 (average (95% confidence interval)). The average score for verbal IQ was not significantly different to 100 (average = 94 (88, 100), p=0.055).

The authors conclude that multiple ear infections in infancy lead to impaired performance by 7 years of age, but that verbal reasoning is unaffected.

Comment on the study and the authors conclusions. How could the study design be improved?

(14) Objective: To investigate the effect of increased levels of background noise on click-evoked otoacoustic emission (CEOAE) recordings and to compare the effectiveness of the default CEOAE program with the QuickScreen CEOAE program in increased levels of noise, using an Otodynamics ILO88 recording device.

Design: The right ears of 40 young adult women with normal hearing were assessed using CEOAEs under four different noise conditions and with two different methods of data collection. The noise conditions were in quiet, 50 dBA, 55 dBA, and 60 dBA of white noise.

Data were collected at each noise level in the default mode and also using the ILO88 QuickScreen program.

Make comments about the study design and its ability to address the stated objective. Suggest how the data may be analysed to answer these objectives.

Ref: K. Rhoades, B. McPherson, V. Smyth, J. Kei and A. Baglioni "Effects of Background Noise on Click-Evoked Otoacoustic Emissions" Ear and Hearing 1998; 19: 450-462.

(15) Ref: The Ketogenic diet: A 3- to 6-Year Follow-Up of 150 children enrolled prospectively. C Hemingway, JM Freeman, DJ Pillas, PL Pyzik. Pediatrics 2001; 108; 898-905.

(i) You have been asked to advise on the long term outcome for a child in your practice who suffered a haemorrhagic stroke 3 months previously. The child is just under 6 years of age and was previously regarded as healthy. Critically review the attached recent article on this topic and state what advice you could give the parents of the child on the basis of its contents.

(ii) Does the article give any evidence that long-term survivors of haemorrhagic stroke in childhood have low self esteem in later life?

(iii) Design a study to test the effect of cognitive therapy for improving long term outcomes in this patient group.

(16) The parents of a 5 year old child with moderate epilepsy approach you with the attached article they have read in The Daily Mail (2nd October 2001: How fats can end fits) and demand that you give them details of how to implement a diet rich in fatty foods that will cure their child and remove the need for any other drugs.

You manage to obtain a copy of the research paper from Pediatrics that the newspaper article is based on.

Ref: The Ketogenic diet: A 3- to 6-Year Follow-Up of 150 children enrolled prospectively. C Hemingway, JM Freeman, DJ Pillas, PL Pyzik. Pediatrics 2001; 108; 898-905.

(a) Critically review the two articles and comment on the applicability of any findings to the child in question.

(b) What advice would you give to the parents?

(c) Write an outline protocol for a study to estimate the effects of switching to a ketogenic diet on frequency of fits amongst pre and infant schoolchildren with moderate epilepsy

(17) Ref: Worawattanakul M, Rhoads JM, Lichtman SN, Ulshen MH. Abdominal migraine: Prophylactic treatment and follow -up. Journal of Pediatric Gastroenterology and Nutrition, 28; 27-40; 1999.

You are responsible for the treatment of children who present to your paediatric clinic and are subsequently diagnosed as having abdominal migraine. There are between 1 and 3 such children per year with varying degrees of severity. Your current policy is to intervene only if symptoms are severe and very disruptive to normal daily activity, and your favoured intervention in this situation is to prescribe pizotifen.

However this treatment is not uniformly effective and the article published in the Journal of Pediatric Gastroenterology and Nutrition catches your attention.

Critically evaluate the study presented. Would it help to change your policy with respect to treatment of childhood abdominal migraine?

Comment on the relative merits of grading response as excellent/fair/poor or number of pain attacks in 6 months.

On the basis of the evidence presented in this paper, what advice would you give someone being prescribed cyproheptadine regarding side effects?

Briefly outline any further research that may be justified.

Longer Exercise Solutions

(1) A confounder is something that 'gets in the way' of a comparison. These are factors which are not of direct interest but vary between the groups of subjects we are interested in and also are associated with the outcome.

For example:

Suppose we are interested in associations between alcohol consumption in pregnancy and reading ability at age 10. Parental social class will probably be associated with alcohol consumption level (groups of subjects with different alcohol consumption rates will vary in their social class distribution) and also influence the child's reading ability.

Therefore social class is a confounder: having observed a difference in reading abilities according to alcohol consumption, we would not know whether this difference was due to alcohol consumption or social class.

Potential confounders for the study in question 1 might be social class, age, sex, ethnicity, etc.

These could be dealt with via the study design by taking age, sex, social class and ethnicity matched pairs of normal and hearing deficient children. By doing this we ensure that the groups (normal/hearing deficient) do not differ on these factors and so they cannot be confounders.

(2) The students should have chosen examples of the following:

(a) A study where the outcome is a continuous normally distributed variable to be compared between two unmatched (or independent) groups

(b) A study of the prevalence of some disorder in a group of individuals

(c) A study where individuals are matched (either a within matched pair randomisation or a disease group with matched controls) and the outcome is continuous (not necessarily normally distributed since within pair differences may be).

(d) A study with a continuous normally distributed outcome that is to be compared between more than two unmatched groups (for example, 4 different ethnic groups or 5 groups following different treatment protocols).

(e) A study with an ordinal or numeric outcome to be compared between two unmatched groups. If the outcome is continuous numeric, then this outcome should not be, nor transformable to be, normal or it should be normally distributed with very different variances in the 2 groups and/or the sample size should be small, otherwise, a 2-samples t-test may be more appropriate.

(3) (a) Randomisation ensures that allocation to groups is not biased by individual features. For example, with a new therapy if we do not randomise to groups but allocate according to personal and physician preference we may find that the 'new therapy' group contains sicker patients (or conversely less ill patients if they are the ones thought able to withstand a treatment change) than the group receiving standard treatment.

(b) If patients and assessors/clinicians are blinded to allocation then any difference found in the treatment groups cannot be due to a subconscious belief that someone is on the better or worse treatment.

(c) Matching prior to randomisation will ensure that the treatment groups are the same with respect to the matched factors. The factors on which matching takes place cannot then be confounders. For example, if age and sex matched pairs of individuals are randomly allocated to the new and standard treatments, then the groups will contain the same numbers of each age and sex and any differences in outcome cannot be due to age or sex differences.

(d) If the new therapy is to be tested against 'nothing' (ie not against some standard or existing treatment), then it is important that a placebo is used if possible. This is to ensure that the treated group is not biased and performs better because of some psychological belief arising from the fact that they have received some treatment (e.g., a pill) rather than it being due to the actual treatment.

(4) The allocation to treatment is systematic (the 2nd of each pair receives the new treatment). This could introduce a bias. Randomisation within pairs should have been used.

The new treatment should be compared with standard treatment (which it may ultimately replace) rather than no treatment.

Age and sex matched pairs are recruited. This pairing should be retained in the analysis, hence a paired t-test should be used (provided the paired differences are approximately normally distributed or can be transformed to be so).

Patients allocated treatment should be compared with those not allocated treatment regardless of whether or not they completed the course of treatment. I.e. Results should be analysed on an intention-to-treat basis.

There is no mention of the sample size. Interpreting the p-value alone is difficult. A confidence interval for the difference in outcome would be much more informative. Statistical significance is not the same as clinical significance.

You could ask whether pairing is the best option; does this lead to some loss of subjects (i.e., those with no pairs)? It might be best to recruit and randomise individuals, then adjust for age and sex in the analysis if these are likely to be confounders (alternatively do a stratified randomisation or use minimisation to ensure groups of similar age and sex).

Are the speech scores that are used known to be valid? Is it possible to blind the study?

(5) Need to detail:

Where will subjects be recruited from?

Are there any exclusion criteria?

How will bias be ascertained between those who agree to take part and those who don't?

Is there a placebo control or is comparison with current treatment?

Will the study be blinded?

If so, how?

What mechanism will be used to determine who receives treatment? How is 'effectiveness' to be measured?

Are there any confounders?

How will these be dealt with?

What is the protocol for dealing with non-compliers?

What is the basis for deciding on sample size?

What tests, confidence intervals, displays, etc., will be used? What are the implications of the possible results for the future use of hearing aids?

(6) Should mention:

No mention of how patients selected from waiting room. Could the sample be biased towards those who arrive early (and hence have time/had no problems with travel and are not arriving late) and/or appear more approachable (maybe certain social classes/sexes/those without other children etc.)?

Some record should be made of those approached but refusing to take part. Are those who agree to participate more likely to have grievances to air?

What about patients with scheduled appointments who do not attend? These will be excluded within current design but obviously form an important group to ask regarding ease of attendance.

Study is to do with patients attending scheduled appointments. It appears that one waiting room from one hospital department is to be used. How representative is this hospital of others? How representative is the ward chosen? It seems unlikely that the experiences from one ward within one hospital will represent patients attending for different needs in hospitals in general. Maybe the research question should be made more specific (e.g., patients attending for a specific reason). More hospitals should be included.

Potential alternative is to make random selection from patients with scheduled appointments from the appointment schedule.

(7) The outcome to be compared will be the proportion (or percentage) of those responding positively for each of the two groups (treated/not treated).

The null hypothesis will be that there is no difference in the proportions (or percentages) responding positively in the two groups.

The difference in proportions (or percentages) should be given with 95% confidence interval and significance test of whether difference is compatible with zero difference in the population.

A significant difference (low p-value, confidence interval does not contain zero) indicates that the data samples obtained are not compatible with treatment being ineffective.

Important points to consider in interpreting results are:

- Were patients randomised to treatment groups? If not, then could bias of some form be confounding the results? E.g., is it the more severely ill who received treatment?

- Were patients blind to their treatment group? c) Was the assessor of positivity blind to the treatment group of the patient? d) Were the results analysed on an intention-to-treat basis? (i.e., according to their allocated group rather than whether they completed treatment or not)

(8) (a) The confidence interval gives the range of population scenarios that the sample data is compatible with. For example, if we have a 95% confidence interval for the difference in sample means (say between a treated and untreated group), then we are 95% confident that the difference in population means (between those treated and those not) lies somewhere within this interval. A 99% confidence interval is wider and we are more confident that it contains the population value (99% confident rather than 95% confident). The 95% confidence interval is calculated by taking the sample estimate ± 1.96 times the standard error.

(b) The p-value is the probability of obtaining the observed sample data if the null hypothesis is true. As it is a probability it takes values between zero and 1. Values close to zero indicate that the null hypothesis is unlikely to be true (low probability) and values close to 1 that the sample data is compatible with the null hypothesis being true (high probability of the sample data).

(c) A significance test is used to assess the probability of obtaining the sample data if some specified null hypothesis is true. The outcome of a significance test is a p-value. The particular type of significance test that must be used for a given situation depends on how many groups are being compared and the nature of the outcome variable (continuous numeric, normally distributed or not, categorical).

(d) The power of a study is its' ability to detect a difference of a given size. The power of a study is greater for larger samples. The power is usually expressed as a percentage. 0% power means that the difference would never be detected, 100% that it is always detected.

(9) Should mention/discuss:

Children from school for special needs/other disabilities may not have IQs representative of children without hearing loss.

Only two schools are used and the children in these may not representative of children from schools of that type.

Are children from deaf schools representative of children with hearing loss anyway?

Matched pairs of children are chosen and hence a paired t-test would be appropriate rather than an unpaired two-sample test.

The result is non-significant (p=0.13) but this does not mean that there is necessarily no association with IQ at age 12. The confidence interval is wide and the data compatible with as much as a 42 point deficit in the hearing loss group.

The study should be designed to give representative samples of hearing loss children and appropriate controls? Do we want similar disability but good hearing or non-disabled children as comparisons?

No mention is made of how many children the study is based on. Some thought needs to be done on the size of difference in IQ that is clinically important and how many children need to be studied to detect this with adequate power.

(10) Should mention/discuss:

Choice of sample, where recruited from, any exclusion criteria?

Size of sample

Outcome measurements. How will dizziness be assessed?

Potential confounders

Placebo control, what form?

Blinding? If so, how?

What statistical tests will be used? What is null hypothesis? Confidence intervals

What are the implications of the possible results for the future treatment of dizziness?

(11) A two-sample t-test can be used to compare the mean quality of life assessment between the tinnitus and non-tinnitus individuals, provided the assessments are normally distributed (or can be transformed to be so).

A Mann-Whitney U-test can be used to compare the median values if the quality of life measurements are non-normal. Multiple regression would be used to compare measurements between groups (tinnitus yes/no) after accounting for the age and sex of the individuals.

(12) The outcome (pure tone hearing threshold) is a continuous measurement that can only take positive values. The results are summarised as mean and standard deviations but these are only appropriate summaries to use if the data is normally distributed. The relatively large standard deviations in relation to mean values suggest that the outcome is probably skew. For example, total group worse ear mean = 36, SD =25; if the values were normally distributed then would expect about 95% of the patients to have had values in the range 36 ± 2(25) = 36 ± 50 = -14, 86 which does not make sense since values cannot be negative. Hence, it would have been better to give median values and inter-quartile ranges. The medians would give a better representation of the values seen in the different groups than the means which are probably inflated by relatively few very poor hearing individuals.

Scatter plots could be used to either:

Plot worse ear against better ear values Or Plot either of the pure tone values (i.e., 2 plots, 1 for worse ear and 1 for better ear) against age.

Any of these plots could show gender by using different symbols for males and females.

(13) We do not know how representative the 12 chosen local primary schools are.

Only 7-8 year old children are studied so only developmental consequences (as measured by performance and verbal IQ tests) apparent at this age will be detected. There may be developmental consequences apparent at other ages and these will not be measured.

Approximately 12x30 = 360 children would have been sent leaflets, only 23 agreed to participate. Since no information is collected from the other 337 children we do not know whether these were eligible (i.e., they had at lesser 3 bouts of ear infection requiring antibiotics in the first 2 years of life) and had decided not to take part or whether they were not eligible. We do not therefore have any idea of how biased the sample studied is.

The sample may be subject to recall bias. There is no check on the validity of the parents claim that the children were eligible.

Although the research question relates to multiple ear infections the study shown only looked at those with 3 or more infections requiring antibiotics. The sample selected on this basis may not be representative of all children who have multiple infections. Some children may have had multiple episodes but not sought medical attention for them all and/or they may not have received antibiotics.

There is no current control group. It is assumed that the expected average of 100 would be found if we used the same tests in children of similar background, recruited and assessed in the same way, who had not had multiple ear infections in infancy. We are not told anything about the social background of the children from the schools selected and whether they are representative of the national distribution. Furthermore, those who choose to reply may not be representative of the eligible children within those schools.

Too much emphasis has been placed on p-values in the interpretation of the results. The values of 0.049 and 0.055 are not very different but using a dichotomy of 0.05 into significant/non-significant means that different conclusions are drawn from the two test results. The confidence intervals show that the sample values are compatible with a mean performance IQ of 99, which although statistically significantly different from 100 is a drop of little clinical relevance. A mean of 87 however, with which the sample is also compatible would be clinically important. For verbal IQ, the sample is compatible with a mean level of 88 which would not fit in with the conclusion that 'verbal reasoning is unaffected'.

A better study would prospectively assess ear infections in a random sample of babies born. The children could be assessed at suitable later stages and their development related to the frequency and severity of infancy ear infections. The assessments should take place with the assessor blind to the infection status of the children. Any potential confounders (for example ethnicity) should be adjusted for in the analyses. The study should be large enough to detect clinically important differences with suitable power.

(14) The study seems adequate to address the research question as stated. However, it would be nice to know that the order of usage of the two methods were randomised and also the order in which the 4 noise levels are assessed (if this is practically possible to do). Otherwise we cannot be sure that there is not an order effect. It would also be informative in our interpretation of the results to know how the young adult women were chosen. Are they representative of all young adult women? Why is it only young adult women that are assessed anyway? Presumably this means that the results cannot automatically be applied to males or to women of differing ages? What age range does the sample cover?

Tables 3 and 4 give the results for two outcomes from CEOAE. Are these the only or best of the measurements that would be obtained given the tests they describe?

Looking at the means, SDs and ranges suggests that these are probably appropriate summary measures to use; i.e., the interval mean ± 2sd ties in approximately with the ranges in most cases, so no skewing is suggested.

Measurements are paired within individuals at each noise level and values could be related between women across noise levels. So, there are a lot of relationships between values within a woman that are lost in the current displays of the data. A simple(ish) analysis might consist of firstly looking at the serial measures within woman across different noise levels for each of the measurement techniques (default and QuickScreen).

The values could be plotted as lines (one per woman): there would be 4 plots; 2 outcomes for 2 measurement techniques. Secondly, the within woman differences between the values obtained from the 2 techniques could be summarised (means and SDs) for each noise level and each technique.

Bland-Altman type plots could be used to display the results and limits of agreement calculated. More complex analyses would involve modelling the outcomes as functions of noise level, order of administration and technique whilst retaining the fact that values are grouped within woman.

(15) Suggested marks allocated : (a) 60%, (b) 10%, (c) 30%

Examiner notes:

Should discuss choice of outcomes, recording methods, choice of patients, and the design of the study.

Variable lengths of follow-up, is it OK to combine outcomes across such a wide range? e.g., we might expect many of the outcomes of figure 3 to be worse nearer to time of stroke and diminish over time.

The controls were not concurrently assessed; they were assessed by different researchers. Therefore, there may be confounding factors.

No reference for 'the difference test for large samples'. We wouldn't necessarily expect students to say this, but it's not clear what test they are referring to here.

IQs are grouped in figures 2; we are not told what the cut-offs used for the categories are. Why not plot the raw values and save arbitrary grouping?

The results of one comparison between those with and without functional impairment. Were other comparisons made but not presented because non-significant? The difference should be given with confidence interval.

The acronym 'SF' is given in legend for figure 3 but not in figures; is SE in 'a' supposed to read SF?

The p<0.025 and z>1.96 are presented, but it is not helpful to give the z-value although this does show the use of a 1-sided test, which is not supported. Conventionally 2-sided tests should be given unless there is strong justification not to do so.

It would be better to give individual line plots for the study group; some measure of variability should be given at least for both the control populations and the study group.

Rather than just giving stars to indicate range on significance, should give average differences and confidence intervals for these. Are differences in means of these magnitudes of clinical importance?

(16) (a) Critically review the two articles and comment on the applicability of any findings to the child in question (45% of marks).

Objective is stated to be about the subgroup of 83 but rest of abstract about 150. The main 'problem' is that study, whilst apparently encouraging, is observational. We do not know whether a large percentage of the children were enrolled at a time when they were having lots of uncontrolled seizures; if so, they may have improved anyway.

The follow up percentage is very high; few were not contacted at all.

(b) What advice would you give to the parents? (10% of marks)

Point out that no real proof diet works and that it is quite a difficult diet to implement (almost half discontinued in first year). On other hand, child is within age range. Moderate? Does this correspond to paper patients? Any reason not to try?

(c) Write an outline protocol for a study to estimate the effects of switching to a ketogenic diet on frequency of fits amongst pre and infant schoolchildren with moderate epilepsy. (45% of marks)

Study should be randomised, controlled, blinded (if possible). What outcomes will be recorded, and when? (i.e., how and when is 'effect' measured?) How will target group be identified?

How will individuals be sampled?

Inclusion/exclusion criteria? How will diet be implemented?

What sample size is needed?

What size of effect would be important?

(17) Points for examiners:

Retrospective study with no control group. Children may have improved regardless of treatment.

Cyproheptadine only given if propranolol unsuccessful i.e., to more severe/persistent patients.

Not clear how long the patients were followed for.

No criteria given for deciding to discontinue treatment. - Important because major aim is to 'evaluate the duration of treatment required for them to remain symptom free'.

4 patients in propranolol group. Standard error of 0.6 implies a standard deviation of approximately 3 (0.6 x sqrt(24)), standard error of 0.3 implies a standard deviation of approximately 1.5 (0.3 x sqrt(24)). Average number of attacks pre and post is summarised as mean +/- standard error. This is probably not appropriate since the size of the standard deviations implies that the data are non-normally distributed. Probably upwardly skew, a few have very many attacks the rest around some lower average. I.e. The means given probably over-estimate the number of attacks that the majority of children suffer. Medians would be better summaries.

In the 12 patients treated with cyproheptadine the number of attacks is probably more severely skewed. Standard errors of 2.1 and 0.5 imply standard deviations of approximately 7 and 1.8 respectively. vii) 'Few' side effects - 3/24 and 2/12 i.e. 1 in 6 or about 17% of the cyproheptadine treated group had side effects severe enough to be reported. These figures require confidence intervals.

Figure 1 could be improved.

Statement that abdominal migraine is 'not rare' is not substantiated. - Would need to know the potential catchment population.

Numbers of attacks in pre and post 6 month periods more sensitive to changes than no attacks post (regardless of number pre)/ less after/ same pre and post which is what the excellent/fair/poor grading equates to. However, numbers of attacks may stay constant but each attack may be less severe - this is recognised as 'fair' in the categorical ratings but is denoted as zero change when attacks are merely counted.

Randomised controlled trial of treatment versus placebo. Is this ethical? What about a trial of treatment immediately versus treatment delayed - this would give an initial comparison period to see whether the problem resolves/improves spontaneously. How long should follow up be? Are there other factors to consider, e.g., age?