XClose

Advanced Research Computing

Home
Menu

Internship Impact Summary: Health Data Science Black Internship with ARC & HDR UK

Ditso Tirelo's project used machine learning to study TP53 mutations, finding key hotspots (e.g., 282, 245) and links to cancer. This research provides insights and a foundation for future studies.

PCA plot from Ditso Tirelo's research showing overlaid data points, separated dataset plots, and centroids for benign and deleterious screens, highlighting TP53 mutation patterns linked to cancer.

27 November 2024

Background

Over the summer, ARC hosted intern Ditso Tirelo in conjunction with Health Data Research UK (HDRUK) to complete a research project with Dr Benjamin Hall of UCL’s Department of Medical Physics and Biomedical Engineering. Dr Hall’s research interests utilise computational biology to investigate how mutations of cells can affect the body, with the intention of informing the understanding of cancer in the medical field. Ditso’s project built upon previous research conducted by the Hall Lab, using machine learning to explore the TP53 gene, which is the most commonly mutated gene in human cancers. Thus, it is of paramount importance to uncover how these mutations arise.

Research Objectives and Approach

The primary objectives of the project were to explore these mutations and their causes, as well as compare cancer cases with healthy cases. To start, Ditso inherited a Python code repository from Dr Hall’s existing research. These three Jupyter notebooks included extensive data from samples and real patients, exploratory tests that generated various data visualisations, calculations, and hypothesis testing. The three mutation effects studied are misfolding, protein association, and DNA binding affinity. Hotspots of mutation were also identified as specific numbered sites along the protein.

To extend the scope of this work, Ditso introduced further factors to incorporate other protein features and tested their validity for this model using ROC analysis. The factors from the given data and the new variables were then pre-processed using Principal Component Analysis (PCA), a dimensionality reduction technique widely applied for machine learning. The results of this highlighted the details of the mutation hotspots, specifically that mutations in position 282 may be cancer inhibitors, as these were present in healthy samples but not cancerous samples. Additionally, the cancerous samples had more mutations present in positions 245 and 176.

Key Findings 

These hotspots were distributed on the TP53 gene around the area where it binds to DNA, thus interrupting this binding process. From the given data, it is also known that the misfolding mutation represented around 70% of the total mutations, from which it can be inferred that the change is likely occurring at the site where DNA binds to the gene. These implications were then used to set up potential avenues to explore in the machine learning stage, specifically whether the mutations could be predicted and if these observed patterns could be applicable to a larger number of mutations. PCA was finally used to separate the mutations into benign or deleterious groups.

Ditso then developed a machine learning model which made use of a random forest classifier to identify the most relevant and highly contributing factors. The algorithm made use of logistic regression to make predictions and explored three different cases:

  1. Whether an amino acid position was likely to mutate by misfolding (most common and highly accurate).
  2. Whether a sample would be cancerous or healthy (more challenging and less accurate).
  3. Whether a sample would be benign or deleterious (moderately accurate).

Impact and Future Potential 

This work establishes a solid baseline for future research efforts with a wider scope and computational capability. In particular, Ditso found that there was a strong relationship between folding energies and prevalence of mutations, which opens the possibility of exploring this correlation in further studies. Additionally, since the machine learning model uses logistic regression, more advanced techniques may be applied to seek out better accuracy in distinguishing cancer samples from healthy samples, as Ditso’s work found this to be a complex and multifaceted task requiring additional data. The computational resources and discoveries that Ditso provided to the Hall Lab team serve their objectives of gaining an understanding of factors affecting cancer, and reflect the technical support provided by ARC staff.

Additional Resources

You can access Ditso’s publicly available source code for all the methods implemented.