Advanced Research Computing


Flagship collaborative projects

Our 'flagship' projects arise from particularly long-lived or strategic collaborations, and often have wide-reaching impact.

Experimental Medicine Applications Platform, FlowEHR, PIXL, and SAFEHR

This suite of projects with UCLH builds on our long-standing collaboration originally for the Health Informatics Collaborative. EMAP is a translational data science platform built in and for the NHS and has been specifically created to support research. Today, the typical way for a researcher to access hospital data is to extract it from the hospital into the outside world. This introduces privacy risks, as the data leave the protected environment of the NHS. EMAP reverses this process. By providing a software environment within the hospital, we enable research to happen inside the NHS, so that patient data never have to leave. EMAP has been developed as a non-operational “mirror” of a subset of UCLH data (historical and live). The underpinning aim is to ensure that no clinical data are corrupted or destroyed during the interaction between the research process and the hospital’s systems and that the systems are not compromised (for instance, if they were interrupted or slowed down by research enquiries).

Various projects are now building on this core platform, or being built alongside it with a view to future integration. For instance, dashboards built on EMAP provided valuable insights to Intensive Care Unit staff during the peaks of the Covid pandemic. The real-time data flows provided by EMAP were crucial for Dr Zella King’s work helping UCLH staff to manage the flow of patients through hospital, which won the Healthcare Partnership of the Year category at the London Higher Awards 2024.

FlowEHR is a safe, secure, cloud-native development and deployment platform for digital research and innovation within the trust. Again, the aim is ultimately to provide direct benefit to patient outcomes and the health system. FlowEHR is an evolution of the original vision for EMAP, encompassing more of the IT infrastructure layers needed to take proof-of-concept research projects and deploy them swiftly on live data for clinical validation.

PIXL (PIXL Image eXtraction Laboratory) is a system for extracting, de-identifying and exporting DICOM imaging data. It can optionally link this imaging data with de-identified electronic health record data and redacted free text reports in the OMOP data model. Finally, the system allows for the export of this linked data to a secure environment to enable medical imaging machine learning. The follow-on ambition is to progress from retrospective research and harness linked imaging data for innovation through direct improvement in patient care and increased efficiency in the health system.

SAFEHR is the team that has evolved around these initiatives. This team now not only encompasses members of ARC, but also includes people from several departments within the hospital, including Information Governance, with the ultimate aim of providing multi-modal data for research, innovation and education as a joint activity across organisational boundaries.

ExCALIBUR high-performance computing benchmarking suite

It is vital to measure the performance of scientific applications in a rigourous, systematic way throughout their development lifetime and across various hardware platforms. This is commonly referred to as benchmarking. Such benchmarking is a crucial activity in the UK’s path to "Exascale", the next generation of large-scale computers. It allows researchers and application developers to make informed decisions about the direction of the development of their code, as well as the hardware platforms to run on effectively, so they can best take advantage of the new opportunities offered. It also allows us to understand better the characteristics of current and future hardware, thereby enabling the UK digital research infrastructure community makes educated choices about the future facilities to invest in.

The benchmarking process though has been somewhat of a dark art, usually involving convoluted, machine-dependent scripts and configurations that only the few can run, re-run, and reproduce their results. The situation is only becoming worse with the advent of the quest for exascale, with the available hardware technologies becoming more heterogeneous. Since 2022 we have been working as a core part of the UK's flagship next-generation computing ExCALIBUR initiative to develop a common benchmarking framework that is open source, user-friendly, and automated, making systematic benchmarking tractable for the community. We have shared those tools and methods among the scientific community, as well as taught relevant skills for application owners to improve scientific software based on benchmarking as they target Exascale.

RSEs from UCL ARC worked with researchers from the Universities of Bristol and Reading, and the Met Office, to build this portable benchmarking framework based on Spack and ReFrame. The project benefited from in-kind contributions from the Universities of Cambridge, Durham and Leicester, partners in the DiRAC community.

The project continues to be well received by the wider UK HPC community, showing that there is a need in the community to address the issue of reproducibility in benchmarking. We have demonstrated through a still-growing suite of benchmark codes and configurations (to which the whole community can add) that we can use our framework to develop benchmarks that can be built and run across different systems in the UK HPC infrastructure. The framework includes a package and library for post-processing benchmarking results.

Energy Demand Observatory Laboratory (EDOL)

EDOL is a five-year project which will build, curate, and make available to scientists, industry and policymakers the first ever consistent, longitudinal, high-resolution, socio-technical dataset of disaggregated energy use in a representative sample of UK dwellings. This data observatory will provide unprecedented insight into the drivers and dynamics of residential energy use, with forensic subsamples allowing specific policy and technology questions to be answered in greater detail. The same data infrastructure will also provide a laboratory and set of tools for researchers and industry to rapidly and efficiently test the impact of new technologies and policy interventions under real world conditions.
ARC is hosting the observatory, building on the system developed for the earlier Smart Energy Research Lab (SERL). The lab provides a technical and data governance infrastructure that allows accredited UK researchers to access half-hourly gas and electricity smart meter readings for ~13,000 consenting GB households. This data is combined with Energy Performance Certificate (EPC), weather, demographic and building information.

Open Richly Annotated Cuneiform Corpus (ORACC)

This was one of our first ever projects, and continues to this day, developing tools for scholars around the world working on translations and transliterations of ancient Mesopotamian (Iraqi) texts. Among other things, we are creating a new user-friendly environment for working with digital records of cuneiform texts, which will automate many of the tasks involved. We are also expanding the capabilities of the ORACC website, making it easier to search through the vast collection of information it contains, for both researchers and interested members of the public.

Screenshot of the Nisaba editor for ORACC, showing syntax highlighting and a validation error report
k-Plan: clinical support software for ultrasound treatment of neurological diseases

We're working with UCL's Biomedical Ultrasound Group, the University of Brno's Supercomputing Research Group, and BrainBox Ltd to develop and maintain a system for running simulations of ultrasound treatments.

Low-intensity ultrasound directed into the head ("transcranial ultrasound stimulation") can stimulate brain activity, this may have therapeutic potential to treat neurological diseases such as treatment resistant depression, epilepsy, and essential tremor. k-Plan is the first "HPC-as-a-service" software that enables clinical staff to run these computationally intensive simulations!

Screenshot of k-Plan ultrasound therapy planning software
Thanzi la Onse

Thanzi la Onse project logo
This project's name means "Health of All" and is a simulation model of healthcare need and service delivery for Malawi. It draws on demographic, epidemiological and healthcare system data to inform decision-makers in Malawi on disease dynamics, health care budgets and allocation of resources in the nation. The team aims to explore a variety of interventions to improve the health of the population in Malawi, as well as reducing health inequality in the ECSA region. ARC designed and implemented the simulation framework to integrate models of over a dozen epidemiologists and health economists. We continue to improve and develop features of the model, whilst helping modellers with onboarding, training and model codes, performance optimisation, and cloud infrastructure and services for running, profiling and reporting from the model.

EPPI-Reviewer is a web application for performing systematic literature reviews, developed by the Evidence for Policy and Practice Information and Co-ordinating Centre (EPPI Centre), which is part of the Social Science Research Unit at UCL's Institute of Education. Its aim is to enable the extraction and synthesis of reliable research findings to inform scientific decisions and policy making.

ARC is working on researching, developing and integrating modern machine learning technologies to the EPPI Reviewer application. One of the tasks is to automate the identification of studies for reviews, by doing semantic search over a vectorised database of research papers. Specifically, the data science team in ARC is working on ingesting millions of documents from the OpenAlex dataset using various embedding models like Spectre2. The document embeddings are indexed into a Postgres database. Work is also ongoing to explore optimal ways to cluster similar research papers via their embeddings.

Another task is to enable natural language question answering over research studies using Large Language Models (LLMs). The team has explored and evaluated different ways of retrieving document chunks most relevant to the user queries, and also assessed the performance of different LLMs on question-answering over the retrieved chunks.

Rivet: Long-term preservation for the Large Hadron Collider and beyond

Differential cross-section for leptonic Z-boson production as a function of the number additional emissions in the hard collision. Predictions from two alternative models (in red and blue) are compared to the experimental data points (in black).
Rivet ("Robust Independent Validation of Experiment and Theory") is the leading framework used by particle physicists to preserve the analysis logic of measurements from the Large Hadron Collider as well as many other past and present particle colliders around the world. The open-source tool kit receives hundreds of new analysis routines every year, giving rise to an ever-growing wealth of collider results that can be easily reproduced and reinterpreted in the light of new theory models. Over the past 15 years, Rivet has become the common language that bridges the gap between experiment and theory in the collider physics community. [arXiv:1003.0694, arXiv:1912.05451, arXiv:2404.15984]
YODA: Multi-dimensional differential histogramming in statistical data analysis

In the contemporary landscape of advanced statistical analysis toolkits, ranging from Bayesian inference to machine learning, the seemingly straightforward concept of a histogram often goes unnoticed. However, its significance is profound − representing a powerful tool that encapsulates summary statistics within binned ranges of independent variables. These bins maintain a fixed data size, irrespective of the number of aggregated events, and boast a mathematical definition intricately connected to fundamental concepts in differential and integral calculus. Exploiting these parallels, the YODA ("Yet more Objects for Data Analysis") statistical analysis library stands out, having been meticulously designed to harmonise mathematical consistency with a user-friendly programmatic interface. [arXiv:2312.15070]

Differential cross-section for leptonic Z-boson production as a function of the number of additional emissions in the hard collision. The Standard Model prediction (in purple) is compared to the experimental data points (in black).
An example a 2D heat map, illustrating the correlations between two arbitrary observables X and Y in a 2D plane. A colour gradient indicates the likelihood, ranging from likely (in yellow) to unlikely (in blue).