The mission of the Institute of Health Informatics is to deliver high-quality, cutting-edge research linking electronic health data with other forms of research and routinely collected data. The Institute has a distinctive vocation in using various kinds of data, often very different from the kind of data in typical epidemiological studies or in clinical trials. These new kind of data bring new challenges for data collection, linkage, curating and analysis.
For example, we can link data from electronic health records from GPs, hospital episode statistics, socio-economic information, and add information derived from genomics, proteomics and metabolomics data. Each of these pieces of data has different design issues such as repeated measures of exposures and outcomes (not independent from each other), missing data not at random, competing risks, measurement error, measurement biases, etc. Each of these issues can introduce its own sorts of bias and even more in combination with each other. Traditional statistical models or novel machine learning (ML) algorithms do not take into account by default all these possible biases and limitations due to the complex structure of the data. We need to adapt the standard analytical methods in a process that we can think of 'translational methodological research'. This adaptation, apart from serving IHI’s research needs, could then be applied in other settings with similar data structures.
The translational role of IHI in methodological research
Just as doctors speak of 'translational medicine' as the specific challenges of carrying the knowledge from 'bench to bed', there is an analogous translational problem in analytic methods research. Often analytic methods are developed under ideal theoretical assumptions of the world that are often very unrealistic. When it comes to apply the methods to real data there are lots of problems and unmet assumptions that puzzle the people that have to apply these methods. There is an important piece of research work on how to adapt the 'ideal' analytical methods to the 'real world'data. This applied research can be very context-specific, depending of the data and the environment.
A big part of this translational methodological research is to translate between different research languages: between the language of epidemiologist and applied statisticians or the language of the machine learning community and clinicians or geneticists. The Institute stands in the middle of these communities that often have serious problems to understand each other, and an important part of our job is to facilitate the conversation between them. The figure below gives an image of the research process and where IHI comes in.
Some of the methodological problems that we face in health informatics include the following.
Problems with data management
- Multiple data sources: data linkage
- Potentially large number of variables: clinical, genomics, proteomics, metabolomics, imaging, wearable devices (what to keep, what to throw, what to summarise?)
- Continuous data flow: how to incorporate new data and discard old data. How to keep our models updated with the data that is flowing in continuously
Problems with the accuracy of the data
- Measurement error: random or systematic (what to do when different methods are used to measure the same thing)
- Misclassification of exposures and diagnosis (or different criteria in different places)
- Text mining: How to extract information from written reports
- In the area of wearable devices: matching iterative development of technology to evidence based medicine
- Selection bias: patients contribute data more intensively to electronic health records depending on their health status. This makes it difficult to find healthy control patients
- Measurement bias of exposures: exposures are measured often because of a previous health problem. This muddles the causal chain between both
- Multilevel clustering: data is naturally clustered in different non-independent dimensions: geographical and administrative (GP practice/hospital/region), ethnical groups, socio-economic status...
- Many repeated measures of exposures and outcomes, often not independent from each other. This makes the ascertainment of causality very difficult
- Competing risks when many outcomes are recorded
- Potential for false positive: many possible analyses of exposures and outcomes and large sample sizes will increase chances of statistically significant but false signals
- Pre-processing of data prior to statistical analysis could be challenging (imaging data, repeated measures). Deciding what is the outcome variable is not always clear
- Combination of information from different sources and with prior knowledge promise to be one of the most challenging issues. Bayesian methods could be a possible solution
Implementation of technology
- Statistical solutions for data governance and privacy. How can we use individual person data for research and guarantee privacy and anonymity?
- Incorporating RCT within normal clinical care (point of care RCT)
- Issues around acceptance and use of new technologies
Figure: Case scenario of research in Farr (on bold where methodological research comes in):
- Someone asks a clinical/epidemiological question
- At IHI we think how to answer the question with available data and analytical methods
- We prepare data for this particular analysis
- We adapt methods for this particular problem
- We run the analysis
- We interpret results and disseminate
Connections with other groups and institutions
For methodological research at IHI it is crucial to collaborate with other institutions that are more focused on theoretical research in their specific areas of expertise. Thanks to our privileged location in central London we are surrounded by such institutions as the Turing Institute for machine learning and big data methods, the Francis Crick Institute for discovery of biomedical mechanisms, the London School of Hygiene & Tropical Medicine with its wide expertise in epidemiology, as well as all the allied institutes and centres at UCL. We can use their expertise to choose different methods to answer our questions and we can work with them to tackle the problems that our data present us with.