SLMS Academic Careers Office

24. Large-scale phylogenomic mapping of domain architecture changes to elucidate gene function evolution

Supervisor Pair: Dr Christophe Dessimoz and Professor Christine Orengo
Potential Student’s Home Department: Genetics, Evolution and Environment

A powerful way of characterising newly sequenced genes is to compare them with evolutionarily related genes. Gene-centric phylogenomic databases such as OMA, developed in the primary supervisor’s lab, elucidate the evolutionary history of proteins in terms of two key evolutionary processes: speciation events (which yield orthologous genes) and duplications events (which yield paralogous genes). However, by focusing on proteins in their full-length, these databases and associated methods largely ignore domain architecture changes (such as gene fusion, fission, domain shuffling, etc.), thus failing to account for a major source of protein function innovation and adaptation.

In this project, we will develop a new method to systematically identify domain architecture changes and infer where they occurred in terms of the evolutionary histories of the gene families involved. In this way, new links reflecting architecture changes will be established between as-of-yet disconnected orthologous groups. In turn, this will pave the route to improvements in protein function propagation and will provide a more comprehensive and integrated framework to study protein evolution across thousands of species.

Understanding how new domain architectures evolve and modify the functions of the proteins will be extremely valuable in function prediction protocols. Since for most organisms, even human, fewer than 10% of genes have detailed experimental characterisation understanding how protein function diverges will benefit prediction algorithms. The Orengo group have expertise in function prediction methods and were ranked 7th (out of 56 methods) in a recent international assessment (CAFA) of prediction methods (Nature Methods, 2013).

The project builds upon the considerable relevant experience and infrastructure of the supervising team, including the OMA orthology database and the CATH protein domain database—two leading resources in their respective areas.