Machine Learning for high-throughput Phenotyping and Comorbidity Mapping in EHR data

Project Summary

Electronic Health Records (EHR) are data generated when patients interact with the healthcare system and contain information on diagnoses, symptoms, procedures, prescriptions, and tests. Identifying patients with a specific disease, its onset and progression (a process called phenotyping) is challenging and time consuming due to the highly dimensional and noisy nature of EHR data. A patient typically has tens of thousands of data points over multiple years and there is significant variation in how diseases manifest and progress. Delivering personalised treatment relies on accurately identifying diseases and their subtypes as well as clusters of comorbidities. Machine learning has the potential to identify disease phenotypes and subtypes, by finding non obvious patterns in complex cross sectional and longitudinal EHR data.

The candidate will work with world leading machine learning and translational bioinformatics experts at BenevolentAI to develop and evaluate machine learning approaches (e.g. sparse latent factor models, nonnegative tensor factorization, semi supervised anchor learning) for identifying neurological diseases, disease subtypes and clusters in multi-modal EHR data. The project will use large scale contemporary data sources (UK Biobank: 500,000 middle aged adults with extensive genotyping/phenotyping and, CALIBER: 15M primary care patients with hospital EHR) and potentially industry sponsored and proprietary data resources. The project will also consider approaches which utilize external information sources, such as published medical literature, for creating more concise and interpretable disease phenotypes.