Summer Intern Project 2013: Hannah James

Multivariate Statistical Methods from Ecology for Handling Immunological Data

Supervisors: Robin Callard and Joanna Lewis

At the end of 2011 an estimated 3.3 million children were living with Human Immunodeficiency Virus (HIV), of whom an increasing number were receiving antiretroviral therapy (ART) (1). There have been very few studies into the long-term reconstitution of the immune system in children on ART.  Reconstitution of a child's immune system differs from an adult's (2), due to different production and death rates of CD4⁺ T-lymphocytes in children (3). Differences between children’s individual patterns of reconstitution have also been observed (4), but the reasons for these differences are not yet fully understood.

Hannah James: Multivariate Statistical Methods from Ecology for Handling Immunological Data


I have used data collected during the ARROW trial, a 5 year clinical trial of treatments for HIV-positive children, involving over 1200 ART-naïve children aged 6 months -17 years wh started ART (5). For this project, I focused on a subset of 199 subjects, whose blood was collected for analysis 0, 4, 12, 24 weeks after starting ART, and every 12 weeks subsequently for an average of 4 years. Two independent fluorescence-activated cell sorting (FACS) analyses were carried out on CD4+ T-cells collected from this subset of subjects, examining expression of the cell-surface markers CD45RA (naive cells), CD31 (recent thymic emigrants (RTE)) and Ki67 (dividing cells) with the first, and CD45RA, CD31 and HLA-DR (activated T-cells) with the other. FACS values were presented as relative percentages of CD4+ t-cells. I used the FACS data from week 0, focusing on the immunological profiles of the children before they initiated ART.

I implemented methods from the R package “vegan”, written for multivariate analysis of ecology data (6), to carry out clustering analysis using the FACS data. I analysed the clusters formed by the different methods and combinations of variables using silhouette plots, correlation statistics, bar plots and bubble plots. This included calculating and testing different distance measures of the data, I found that Euclidean distance matrices produced the best results. Using Euclidean distance, Ward’s method of hierarchical cluster analysis was best to select the optimum number of clusters. The final clustering partitioning could then be produced using non-hierarchical k-means clustering, which assigns observations to the cluster containing their nearest mean.

I analysed the output clusters using a number of methods. FirstI analysed the output clusters according to other variables from the original data by producing boxplots. Age seems to be the primary driving factor in changes in relative proportions of t-cell subsets. This agrees with observations in healthy children (7), where the proportion of naïve t-cells decreases with age, and memory t-cells increase with age.

To conclude, age was the principle factor driving the clustering. Those clusters with the elder children in them also contained children with the lowest CD4 Z scores, a measure of CD4 count standardized for the age expected CD4% in healthy children.

This project was my first experience of computing and analysing data, and I feel like I have learnt a lot. The mathematical approach to biology has emphasised the importance of taking a fully interdisciplinary approach when considering a problem.

1. UNAIDS (2012) UNAIDS report on the global AIDS epidemic 2012

2. J Lewis et al. (2012) J Infect Dis 205:548-556

3. I Bains et al. (2009) Blood 113:5480-5487

4. M-Q Picat et al. (2013) Under review

5. ARROW Trial Team (2013) Lancet Early online publication

6. D Borcard et al (2011) Numerical Ecology with R, Springer

7. R van Gent et al (2009) Clinical Immunology 133:95-107

8. S Resino (2006) J Acquir Immune Defic Syndr 42:269-276

Page last modified on 15 aug 13 20:48