Molecule learning

Motivation

The development of a new drug is usually time-consuming (often taking more than 10 years) and expensive (costing more than $1 Billion). The application of machine learning to drug design has great potential to accelerate the drug design process, which is vital to the success of the UK pharmaceutical industry. In particular, machine learning (ML) methods achieves superior performance on binding affinity tasks. However, this has not resulted in breakthroughs in developing novel ligands for protein targets of interest due to the limitation of benchmark data and generalization ability of ML models [1]. Therefore, it is essential to further improve the predictive performance and robustness of ML algorithms for binding affinity prediction.

Project Objectives

In collaboration with the team of Professor Ross King (Cambridge University and Chalmers Institute of Technology), we aim to explore the efficient representation of 3D data and learning algorithms for binding affinity prediction. In particular, we plan to investigate applications of the path signature to represent the spatial structure in a principled and efficient way. Besides, we are interested in enhancing the state-of-the-art machine learning methods for binding activity prediction by incorporating effective spatial feature representations and relational learning. This project is funded by the Alan Turing Institute (Turing Project Link).

Methodologies and Research Findings

The signature of a path originates from rough path theory, which is a generalization of classical control theory to make sense of differential equations driven by rough signals, even rougher than semi-martingales. Recently, the signature of a path has been used as a principled and effective feature for sequential data, and it is proved empirically that the signature feature can often bring a performance boost when combined with state-of-the-art machine learning methods. However, it is still unclear that how to use the signature feature to represent spatial structure.

In this project, we aim to extend the applications of the signature feature to the 3D spatial molecule data. Mathematically, one can regard the molecule data as an unlabelled graph, which is composed of nodes (atom) and edges (bonds). For each atom (C), one can consider its neighbouring region of a certain radius (R) and enumerate all possible pathlets starting at an atom C of length (R). Then the average of the signature of each pathlet can be regarded as the expected signature of a molecule path of length R starting at atom C, and it can be proved that the expected signature fully characterizes the 3 D structures of the molecule over such a local region. Numerical results show that combining gradient boost tree methods and the expected signature achieves a decent prediction performance on binding activity score prediction.

Researchers

Professor Ross King, Cambridge University and Chalmers Institute of Technology
Dr Abbi Abdel Rehim, Department of Chemical Engineering and Biotechnology, University of Cambridge
Dr Oghenejokpeme I. Orhobor, Department of Chemical Engineering and Biotechnology, University of Cambridge
Dr Hao Ni, Department of Mathematics, UCL
Hang Lou, Department of Mathematics, UCL

Reference

Luo, H., Ye, H., Ng, H.W., Shi, L., Tong, W., Mendrick, D.L. and Hong, H., 2015. Machine learning methods for predicting HLA-peptide binding activity. Bioinformatics and biology insights, 9, pp.BBI-S29466.