Crime scene DNA analysis: assessing the evidence

The Research Software Development Group assisted Prof David Balding of the UCL Genetics Institute with his forensic DNA analysis package, used to assess the weight of evidence from partial DNA samples

21 August 2015

Background

LikeLTD is a software package, originally programmed by Professor Balding for interpreting DNA profiles obtained from very small samples found at crime scenes. DNA evidence, it turns out, is not always clear cut. Some of the key markers that forensic scientists look for may well have degraded badly and become unrecognisable. Or, the sample may be a mixture of DNA from different people, some of whom may not have been identified. The software solves these problems by calculating the likelihood of different hypotheses for the defence: "It was not the suspect, I say, but some unknown person!", and for the prosecution: "No, sir. Indeed it was the suspect!". The verdict from the DNA evidence becomes the ratio between the likelihood of the two hypotheses.

The Research Software Development Group (RSDG) were asked to get involved with this project in order to tackle two main objectives:

Clean up the code, giving it a coherent and easy-to-follow structure so that it can be submitted to the Comprehensive R Archive Network (CRAN) - a repository for source code and software using the R programming language.
Improve the overall speed of execution.

What we did

RSDG began their work on this project by meeting with Professor Balding's group and getting to grips with the scientific context of LikeLTD. Regular meetings over the course of the three-month project ensured a genuinely collaborative approach so that the researchers were aware of, and had input into, each stage of development.

As a first step, the code was rewritten in R to give it a modular structure, which increased clarity within the code as to what was happening at each stage of the analysis, and increased flexibility for future improvements to be made to individual modules. This groundwork enabled RSDG's team to work on increasing the speed of execution by improving the performance of individual modules.

Gains in performance were made by rewriting some of the modules in C, which is generally more efficient than R. This also allowed for the code to be parallelised through the addition of OpenMP instructions in order to take advantage of systems with multiple cores.

Following an analysis of one of the principal optimisation algorithms used in LikeLTD, the RSDG team were also able to suggest an alternative, more efficient method with the potential to generate further gains in performance compared to the original code. The modular organisation of the new code makes it much easier for researchers to try out the many different optimisation methods available in the public domain and find which one is best adapted for their particular problem.

The result was a well-documented and thoroughly tested set of modules which can be installed as a package in R, ready for submission to CRAN, the online repository for sharing R packages. Professor Balding's team were presented with a comprehensive report at the end of the project, comprising figures and analysis of the improvements made.

Results / impact of the work

" I think I will learn a lot from your coding. Very clearly laid out and easy to follow, not just the comments but the physical layout of code blocks, and the use of self-explanatory objects that utilise obvious English words - Adrian Timpson, LikeLTD author

Speed-up with respect to the original LikeLTD code for a range of different problem sizes (the number of alleles under consideration). Parallelisation of the code (adding more threads) produces further gains in performance - more than 30x in some cases.

The hope is that the improvements made in terms of code structure, testing and documentation will make it much easier for researchers to collaborate and build upon the existing code in future.

The improvements that have been made to the speed of execution of the code have had a significant impact on workflow for users: where the initial set-up stage of the analysis could take up to 45 minutes, this is now typically completed in seconds. During the second stage of analysis, where particularly large jobs could take several days to complete, jobs are typically completed within two to three hours. Submission of LikeLTD into the CRAN repository will increase the profile of this work and should lead to the establishment of this software as an essential tool for forensic DNA analysis around the world.

Crime scene DNA analysis: assessing the evidence

Background

What we did

Results / impact of the work

Links: