Repackaging legacy software for protein structural domain annotation: an OSSS project
ARC worked with the Orengo group (UCL Structural & Molecular Biology) on a project to make an existing bioinformatics data pipeline more portable, scalable and ready to be applied to huge new datasets
30 August 2024
Figure 1 (above) Outline of the CATH classification pipeline.
Background
One of the fundamental building blocks in the field of Computational Biology is the identification of evolutionary relationships between proteins. However, proteins can be large and complex; they often consist of multiple independent units known as structural domains (defined as compact, semi-independent folding units). These domains have their own evolutionary story and are often seen recurring in different contexts in different multi-domain proteins. As such, being able to accurately identify the boundaries of these structural domains, i.e. recognising where structural domains start and stop, is a crucial first step in all subsequent analyses such as protein engineering, disease diagnostics and drug design. Providing accurate domain boundaries at scale can be a non-trivial problem though; often requiring expert manual curation, guided by a variety of metrics from automated algorithms.
What we did
The Orengo Group has used an automated data pipeline for many years while creating and maintaining the CATH database of protein structural domains [1]. Part of this data pipeline involved running algorithms responsible for identifying protein structural domains from 3D data. However this pipeline consisted of old code that was difficult to maintain, difficult to extend with more modern algorithms, and also difficult to move over to new compute facilities at UCL. Dr Robert Vickerstaff from ARC worked with Dr Ian Sillitoe from the Orengo group in UCL SMB to isolate existing legacy code into portable modules that could be run as independent units of work in addition to gluing these modules together using a modern, well tested workflow management framework (Nextflow). This would provide the flexibility and portability to allow these data pipelines to be easily adopted both within HPC facilities at UCL and in the wider community as part of larger data workflows.
The impact
The main work achieved as part of this Open Source Software Sustainability (OSSS) funded project was to build Nextflow data pipelines that could be run natively on the HPC facilities both in ARC and Computer Science with minimal changes to configuration. In addition, the existing Nextflow scripts were refactored to take into account patterns of recommended best practices. This helped to make the scripts more maintainable, easier to extend in the future and easier to share with the community. The project is still in active development and is available on GitHub [2].
Links
[2] https://github.com/UCLOrengoGroup/cath-alphaflow