UCL-CS Bioinformatics Introduction

Introduction

The Bioinformatics Group at University College London is headed by Professor David Jones, and was originally founded as the Joint Research Council funded Bioinformatics Unit within the Department of Computer Science at UCL. The Unit has now been fully integrated into the department as one of the 11 CS Research Groups. The group's main aim is to develop, and apply state-of-the-art computational techniques to tackle problems now arising in the life sciences, particularly those now appearing in the post-genomic era. A particular emphasis of the group is on applications of machine learning techniques to biological problems. The group's interdisciplinary research is closely linked with the Institute of Structural and Molecular Biology, though we also maintain and encourage links to other UCL departments and Centres. The Group occupies dedicated space within the Department of Computer Science, along with space shared with the Faculty of Life Sciences and makes full use of the available Departmental computing facilities, including a 1000-compute node HPC within the department with over 300k+ GPU cores. The Group also maintains some dedicated computing facilities of its own to allow maintenance of specialized biological databases and public access to the software and methods developed within the Group.

Group Research

Graphic showing a simplified graphic of a neural network being used to predict protein structures

Protein Structure Prediction

DMPfold2
DMPfold
S4Pred
PASS
PSIPRED
DISOPRED

S4Pred

Publication Copy citation Github repository

S4PRED predicts the secondary structure of a single protein sequence in the absence of homology information, achieving a Q3 score of 75.3% on the standard CB513 test set, taking only single sequences as input. Although they don't perform as their homology-based counterparts, single-sequence methods are not constrained by the requirement for evolutionary information. More accurate single-sequence approaches have the potential to improve structural modelling across the vast majority of sequence space, especially in areas of great scientific interest like viral proteins, the “dark proteome”, and de novo protein design. Academic users can download the model here.

DMPfold2

Publication Copy citation Github repository

DMPfold2 is an ultrafast end-to-end deep learning method that predicts tertiary structure using only a multiple sequence alignment (MSA) as input. The network makes use of just three recurrent networks and a stack of residual convolutional layers, making the predictor very fast to run, and easy to install and use. Our approach constructs a directly learned representation of the sequences in an MSA, starting from a one-hot encoding of the sequences. When supplemented with an approximate precision matrix, the learned representation can be used to produce structural models of comparable or greater accuracy as compared to our original DMPfold method, while requiring less than a second to produce a typical model.

DMPfold

Publication Copy citation Github repository Understanding the webserver results

DMPfold uses deep learning to predict inter-atomic distance bounds, the main chain hydrogen bond network, and torsion angles, which it uses to build models in an iterative fashion. DMPfold produces more accurate models than two popular methods for a test set of CASP12 domains, and works just as well for transmembrane proteins. It produced confident models for 25% of all Pfam domains without known structures and models for 16% of human proteome UniProt entries without structures and generates accurate models with fewer than 100 sequences in some cases. Using the DMPfold method we have modelled all but ten of the proteins without templates in the JCVI-syn3.0 minimal genome. The paper is available here . A broader discussion of the use of deep learning in structural prediction is available here .

PASS: Profile Augmentation of Single Sequences

Publication Github repository

Profile Augmentation of Single Sequences (PASS) is a simple but powerful framework for accurately modelling single orphan protein sequences in the absence of homology information. S4PRED uses PASS to achieve an unprecedented Q3 score of 75.3% on the standard CB513 test. PASS provides a blueprint for the development of a new generation of predictive methods, advancing our ability to model individual protein sequences.

PSIPRED: Predict Secondary Structure

Publication Copy citation Github repository Understanding the webserver results

PSIPRED is a simple and accurate secondary structure prediction method, incorporating two feed-forward neural networks which perform an analysis on output obtained from PSI-BLAST (Position Specific Iterated - BLAST). Using a very stringent cross validation method to evaluate the method's performance, PSIPRED 3.2 achieves an average Q₃ score of 81.6%. Predictions produced by PSIPRED were also submitted to the CASP4 evaluation and assessed during the CASP4 meeting, which took place in December 2000 at Asilomar. PSIPRED 2.0 achieved an average Q₃ score of 80.6% across all 40 submitted target domains with no obvious sequence similarity to structures present in PDB, which ranked PSIPRED top out of 20 evaluated methods (an earlier version of PSIPRED was also ranked top in CASP3 held in 1998). It is important to realise, however, that due to the small sample sizes, the results from CASP are not statistically significant, although they do give a rough guide as to the current "state of the art". For a more reliable evaluation, the EVA web site at Columbia University provides a continuous evaluation. NOTE that at the time of writing, the EVA site is no longer being updated. Downloads: The PSIPRED V3.2 software can be downloaded from HERE. Please note that you should read the license terms given in the README file if you wish to incorporate PSIPRED in another program or Web server. Older releases of PSIPRED can be downloaded here HERE.

DISOPRED3: Protein intrinsic disorder prediction

Publication Copy citation Github repository Understanding the webserver results Tutorial

DISOPRED3 represents the latest release of our successful machine-learning based approach to the detection of intrinsically disordered regions. The method was originally trained on evolutionarily conserved sequence features of disordered regions from missing residues in high-resolution X-ray structures. DISOPRED2 mainly addressed the marked class imbalance between ordered and disordered amino acids as well as the different sequence patterns associated with terminal and internal disordered regions using SVMs. DISOPRED3 extends the previous architecture with two independent predictors of intrinsic disorder - a neural network and a nearest neighbor classifier - which were trained to identify long intrinsically disordered regions using data from the PDB and DisProt databases. The intermediate results are integrated by an additional neural network. DISOPRED3 was blindly tested and compared during the ninth and tenth rounds of the world-wide CASP experiments, where it was found to achieve high levels of specificity (about 99%) and therefore precision (about 75%). Indeed, the official assessment teams ranked DISOPRED3 at the top or near the top across a number of tests and evaluation measures. To provide insights into the biological roles of proteins, DISOPRED3 also predicts protein binding sites within disordered regions using a SVM that examines patterns of evolutionary sequence conservation, positional information and amino acid composition of putative disordered regions. Using a stringent test set, DISOPRED3 predictions were found to improve over existing methods, achieving approximately 20% precision and 30% recall. These results highlight the need for additional efforts in the area.

PDB structure 1J2P with 7 subunits coloured differently

Protein Domain Detection and Classification

Merizo

Preprint Copy citation Github repository

Merizo is a deep learning-based domain segmentation method that is trained on CATH annotations and can operate directly on PDB and AlphaFold2 structures. Notably, it makes use of AlphaFold2's Invariant Point Attention module in an input-reversed manner, using it to read a folded structure into a latent representation, where each residue embedding is then clustered into domains in a bottom-up manner. The network learns to cluster residues into domains based on an affinity learning principle, whereby the latent representation of pairs of residues are encouraged or discouraged to be similar based on whether or not the residues belong to the same domain instance.

pGenTHREADER and pDomTHREADER: Fold Recognition

Publication Copy citation Github repository Understanding the webserver results

pGenTHREADER and pDomTHREADER are fast, sensitive methods for predicting protein fold structures by comparison with protein templates of known structure. Their speed makes them suitable for full proteome annotations. pDomTHREADER is an accurate and sensitive superfamily discrimination, combining information from both sequence and structure to produce highly accurate domain alignments. The method employs the same underlying threading algorithm as pGenTHREADER, however it aligns sequences to a domain-based template library rather than a chain-based template library. The use of smaller regions of structure for templates means that different features of the alignments are required for optimal scoring. The final prediction score results from an SVM trained on a combination of 5 different feature inputs; template coverage, alignment score, template length, solvation and pairwise potentials. Compared with other superfamily discrimination methods using Hidden Markov Models and PSI-BLAST profile alignments, we found that pDomTHREADER provided higher coverage on the CATH S35 superfamilies. Additionally, pDomTHREADER produced more accurate alignments that can be used to better predict domain boundaries. Please note that the pDomTHREADER method is tuned for performance in fine superfamily discrimination, for fold recognition problems or structural annotation of very distant sequences, pGenTHREADER should be used.

DomPred & DOMSSEA : Domain Boundary Prediction

Publication Copy citation Github repository Tutorial

DomPred is a protein structural domain boundary predictor. The DomPred process runs 2 independent protein domain predictors; DomPred and DOMSSEA. The DomPred process begins by using PSI-BLAST to match a database of Pfam-A domains to the query sequence, where not clear domains can be match it then proceeds to search the nrdb90 sequence database with PSI-BLAST. The final prediction is procduced by analysing the locations of all the N and C boundaries for each hit. For the DOMSSEA process predicted secondary structure patterns in the query sequence are matched to a library of SCOP domain secondary structure patterns.

Protein Function Prediction

FFPRED
FFPRED-GAN
String2go

FFPRED: Protein Function Prediction

Publication Copy citation Github repository Understanding the webserver results

FFPred predicts Gene Ontology terms for human protein chains, when homology with characterized proteins can provide little aid. Predictions are made by scanning the input sequences against an array of Support Vector Machines (SVMs). Each of these SVMs examines the relationship between protein function and biophysical attributes describing secondary structure, transmembrane helices, intrinsically disordered regions, signal peptides and other motifs. The latest version FFPred3 features a larger SVM library that extends its coverage to the cellular component sub-ontology for the first time.

FFPRED-GAN: Generative Adversarial Networks

Publication Github repository

FFPRED-GAN provides a data augmentation method, enabling the generation of high-quality synthetic samples to improve the accuracy of function prediction methods. It uses a novel generative adversarial network to accurately learn the high-dimensional distributions of protein sequence-based biophysical features and generates high-quality synthetic protein feature samples, which can be used to augment the original training protein feature samples.

String2go

Publication Github repository

String2go predicts protein function using Protein-protein interaction network data. It works by assuming that interacting proteins tend to have similar functions, and uses deep maxout neural networks to learn functional representations simultaneously encoding both protein-protein interactions and functional predictive information.

Protein Contact Analysis

DeepCov

Publication Copy citation Github repository

DeepCov is a deep neural network-based method for predicting residue-residue contacts in protein sequences. It uses fully convolutional neural networks to analyze amino-acid pair frequency or covariance data derived directly from sequence alignments, without using global statistical methods. The method achieves state-of-the-art precision in contact prediction, matching or exceeding the performance of other methods such as CCMpred and MetaPSICOV2. DeepCov's performance is particularly notable in shallow sequence alignments (fewer than 160 effective sequences), indicating that it could be useful for predicting contacts in smaller sequence families.

HSPRED: Protein-Protein Interaction Characterization

Publication Copy citation Github repository Tutorial

Protein-protein interactions are critically dependent on just a few ‘hot spot’ residues at the interface. These hot spots make a dominant contribution to the free energy of binding and they can disrupt the interaction if mutated to alanine. HSPred is a support vector machine(SVM)-based method which predicts hot spot residues, given the PDB structure of a complex.

DMP: DeepMetaPSICOV for Inter-Residue Contact Prediction

Publication Github repository

DeepMetaPSICOV (abbreviated DMP) is an interresidue contact predictor based on a deep, fully convolutional residual network and a large input feature set. Data was augmented by including partial loop deletions, simulated by probabilistically removing rows and columns in the input tensors and contact maps; linearly interpolating input tensors generated using different alignments; and by flipping input feature tensors and contact maps by 180°, corresponding to a reversal of the chain direction.

Metsite: Metal Binding Site Prediction

Publication Copy citation Github repository

Metsite predicts residues forming the metal-binding site in super-families by combining sequence profile and structural information. It uses neural networks to predict six commonly occurring metal ion sites: Ca2+, Cu2+, Fe3+, Mg2+, Mn2+ and Zn2.

De Novo Protein Design

DARK
Protein-VAE

Graphic of some proteins embedded in a lipid bilayer.

Transmembrane Protein Modelling

MEMSAT & MEMSATSVM

Publication Copy citation Github repository Downloads

MEMSAT V3 is a widely used all-helical membrane protein prediction method based on the MEMSAT method. The MEMSAT algorithms calculate the most probable length, location and topological orientation for each transmembrane segment. They make use of scores compiled from membrane protein data about the propensity of each amino acid to be in one of five states (inside loop, outside loop, inside helix end, helix middle and outside helix end) and a dynamic programming algorithm to search through all possible topological models by a process of expectation maximization. The method was benchmarked on a test set of transmembrane proteins of known topology. From sequence data MEMSAT was estimated to have an accuracy of over 78% at predicting the structure of all-helical transmembrane proteins and the location of their constituent helical elements within a membrane.

MEMSATSVM is highly accurate predictor of transmembrane helix topology. It is capable to discriminating signal peptides and identifying the cytosolic and extra-cellular loops. The original paper is available here.

MEMPACK : Transmembrane helix contact prediction

Publication Copy citation Github repository

MEMPACK is a membrane helix packing predictor. The process leverages MEMSATSVM predictions to predict possible inter-helix interactions. The final step a helix packing is produced that orients the helices such that the greatest number of predicted interactions face one another.

Memembed

Publication Copy citation Github repository Downloads

Memembed developed in 2013 provides the code to predict knowledge-based membrane potentials, calculated from a statistical analysis of transmembrane protein structures, coupled with a genetic and direct search algorithms, and demonstrated its use in positioning proteins in membranes, estimating membrane thickness, refinement of transmembrane protein models and in decoy discrimination.

Collaborations

The future of the PSIPRED workbench

Website Key collaborator: Dr Daniel Buchan Publication Copy citation

Development of the PSIPRED workbench is an ongoing collaboration with Dr Daniel Buchan of Goldsmiths, Univerity of London.

Genome3D

Website Publication Copy citation

Genome3D is a freely available resource that provides consensus structural annotations for representative protein sequences taken from a selection of model organisms. The group provides 3D Models based on DomSerf and domain predictions based on pDomTHREADER.

CATH

Website Lead researcher: Professor Christine Orengo Publication Copy citation

CATH is a classification of protein structures downloaded from the Protein Data Bank. Protein domains into superfamilies when there is sufficient evidence they have diverged from a common ancestor.

ELIXIR

Website Publication Copy citation

ELIXIR coordinates and develops life science resources across Europe so that researchers can more easily find, analyse and share data, exchange expertise, and implement best practices. This makes it possible for them to gain greater insights into how living organisms work. PSIPRED is an ELIXIR-UK service.

ISMB: Institute of Structural and Molecular Biology

Website Key collaborators Dr Renós Savva and Dr Cen Wan

ISMB was founded in 2003 to promote multi-disciplinary research at the interface of structural-, computational- and chemical biology at UCL and Birkbeck. The Jones lab is an active member of ISMB, and collaborates with Dr Renós Savva on protein design, and Dr Cen Wan on functional prediction.

Introduction

Group Research

S4Pred

DMPfold2

DMPfold

PASS: Profile Augmentation of Single Sequences

PSIPRED: Predict Secondary Structure

DISOPRED3: Protein intrinsic disorder prediction

Merizo

pGenTHREADER and pDomTHREADER: Fold Recognition

DomPred & DOMSSEA : Domain Boundary Prediction

FFPRED: Protein Function Prediction

FFPRED-GAN: Generative Adversarial Networks

String2go

DeepCov

HSPRED: Protein-Protein Interaction Characterization

DMP: DeepMetaPSICOV for Inter-Residue Contact Prediction

Metsite: Metal Binding Site Prediction

Protein-VAE: Design of Metalloproteins & Novel Protein Folds

DARK: Learning Deep Generative Models for De Novo Protein Design

MEMSAT & MEMSATSVM

MEMPACK : Transmembrane helix contact prediction

Memembed

Collaborations

The future of the PSIPRED workbench

Genome3D

CATH

ELIXIR

ISMB: Institute of Structural and Molecular Biology