Centre for Research on Evolution, Search and Testing (CREST)


The 63rd CREST Open Workshop - Genetic Improvement and Software Specialisation

27 March 2023–28 March 2023, 10:00 am–5:00 pm

Event Information

Open to



Sold out


Dr Justyna Petke, Prof Federica Sarro, Prof William Langdon, Dr Giovani Guizzo, James Callan, CREST Centre, SSE Group, Department of Computer Science, UCL, UK


Room 103
Engineering Front Building
Torrington Place

The topic of this workshop is Genetic Improvement and Software Specialisation.  GI uses automated search in order to improve existing software. Aside from bug fixing, GI has been used to optimise software properties such as runtime, memory and energy consumption, and specialising to particular application domains.  In light of GI usage to improve software performance through software specialisation, we invite experts in similar areas, e.g., software performance, compiler optimisation, parameter tuning, optimisation for particular application domains.  The goal of this workshop is to reflect on the progress that the research community has made over the last years, share experience in research and deployment, and identify key challenges that need to be addressed in future research.
Location: Engineering Front Building, Room 103


Day 1 - 27th March  2023

10:00 Pastries

10:30 Welcome and Introductions

11:00 Prof. Myra Cohen, Iowa State University

Keeping Secrets: A Journey in Multi-Objective Genetic Improvement

 Information leaks in software can unintentionally reveal private data, yet they are hard to detect and fix.  While there are some static approaches to find leaks, they require specialist knowledge and may not guide the developer towards a successful patch. At the same time dynamic detection and repair remains elusive. In this talk I discuss our journey to design a genetic improvement approach for reducing information leakage.  Our solution, LeakReducer, first leverages hyper-testing to find and quantify unwanted flow of information and then repairs the leakage via genetic improvement. Along the way, we learned that (a) traditional security testing approaches like fuzzing may be unsuited for finding leakage, and (b) program correctness without context is subjective. In fact, I argue that genetic improvement is fundamentally a multi-objective endeavor where it may not be possible to satisfy both functional and non-functional objectives as specified. This leads to the ultimate question of what it means for a program to be correct in the context of genetic improvement.

More About The Speaker

11:30 Prof. Bill Langdon, University College London

GI success stories and what the future holds

Software is and will remain the dominant technology of the third millennium. Without programs computer hardware is useless. The world is increasingly automated but despite many advances in software engineering tools, the production and maintenance of software remains labour intensive. The dream of total automated programming remains distant but increasingly search based optimisation is being applied to improve existing software.

GI was demonstrated to give a 70 fold speed up for a state-of-the-art bioinfomatics tool (Bowtie2) for a particular task. A GPU based tool, BarraCUDA, was the first to accept GI changes into production and the automatically optimised code changes have been down loaded many thousands of times. Another state-of-the-art tool, RNAfold, has been evolved in several ways, including speeding code using parallel hardware and tuning data parameters to give more accurate answers. Again these GI changes have been accepted into the standard release and download many thousands of times, including by people generating better RNA probes to detect COVID-19.

Software is an engineering material. It is plastic and robust. It can be reformed and re-used in many ways. Sometimes things go wrong but in many cases it can be and is used and delivers economic benefits despite containing many errors (bugs). We have started using information theory to give insights into why hand made software is in practice robust, the difficulty of testing, and the implications for test oracle placement. The average impact of errors and runtime perturbations falls exponentially with nesting _depth_, so even increasing the number of IID test cases, increases the visibility of disruptions only logarithmically. 

Almost all software is written in high-level languages, and mostly genetic improvement is applied directly to human written program source code. However researchers have shown that Java byte code, assembly code and even binary machine code can also be automatically evolved. Recent experiments (EuroGP 2023) have shown that GI can improve LLVM intermediate representation (IR) even exceeding compiler optimisations, for two industrial open source programs from Google (OLC) and Uber (H3).

More About The Speaker

12:00 Lunch

13:00 Dr. Markus Wagner, Monash University

CryptOpt and Socialz — search tools for specialised assembly code and for diverse community interaction

In this presentation, I will ever so briefly outline two projects: (1) CryptOpt (https://arxiv.org/abs/2211.10665) is the first compilation pipeline that specialises high-level cryptographic functional programs into assembly code significantly faster than what GCC or Clang produce, with mechanised proof (in Coq). We apply randomised search through the space of assembly programs, with repeated automatic benchmarking on target CPUs. The overall prototype is quite practical, e.g. producing new fastest-known implementations for the relatively new Intel i9 12G, of finite-field arithmetic for both Curve25519 (part of the TLS standard) and the Bitcoin elliptic curve secp256k1. (2) Socialz (https://arxiv.org/abs/2302.08664) aims to provide anyone with the capability to perform comprehensive social testing, thereby improving the reliability and security of online social networks used around the world. Socialz is a novel approach to social fuzz testing that (i) characterises real users of a social network, (ii) diversifies their interaction using evolutionary computation across multiple, non-trivial features, and (iii) collects performance data as these interactions are executed.

More About The Speaker

13:30 Dr. Oliver Krauss, University of Applied Sciences Upper Austria

Pattern Mining and Genetic Improvement in Compilers and Interpreters

Source code can be improved through the process of genetic improvement, which involves creating numerous variants of the same software. By using pattern mining, we can identify recurring patterns in the code that are responsible for non-functional properties or bugs. In this talk, we'll explore ways to identify patterns in software variants and apply them in genetic improvement to optimize a software's runtime performance.

More About The Speaker

14:00 Dr. Luca Traini, University of L'Aquila

Towards Effective Java Performance Evaluation: Are we there yet?

Performance evaluation is a crucial activity in modern Software Engineering (SE). Software development processes rely on performance evaluation to assess the impact of software revisions, and state-of-the art SE techniques, including Genetic Improvement (GI), incorporate performance evaluation as a key step of their methodologies. As a consequence, inadequate performance evaluation can hinder release velocity of software development processes, or inadvertently introduce performance issues in production.  Similarly, the effectiveness of state-of-the-art SE techniques, such as GI, can be severely affected by a suboptimal performance evaluation. Unfortunately, conducting effective performance evaluation presents daunting challenges, especially when assessing software subject to just-in-time compilation, such as Java software. In this regard, our recent empirical study on Java performance evaluation exposes potential pitfalls and shortcomings in current practices and state-of-the-art techniques. Our findings demonstrate that such approaches often fall short in providing effective performance evaluation, leading to significant consequences, including prolonged execution times and inaccurate results. This is a joint work with Vittorio Cortellessa, Daniele Di Pompeo and Michele Tucci. Open-Access article: https://doi.org/10.1007/s10664-022-10247-x

More About The Speaker

14:30 Tea/coffee break

15:00 Dr. Sandy Brownlee, University of Stirling

Lost weekends with Gin: recent updates and results for the Java GI toolbox

Gin is a Java-based toolbox for experimentation in genetic improvement of software, first released in 2017. This presentation will share some recent developments with Gin designed to keep it up to date with modern Java. We will also cover some preliminary results exploring the search space associated with several different approaches to modifying code.

More About The Speaker

15:30 Dr. Aymeric Blot, Université du Littoral Côte d'Opale

Magpie: Machine Automated General Performance Improvement via Evolution of Software

Magpie is a powerful tool for automated software improvement that provides a unified framework for exploring the space of possible software improvements. With its support for program transformations, parameter tuning, and compiler optimizations, Magpie enables researchers and software engineers to improve both the functional and non-functional properties of software. In this presentation, we will introduce Magpie and demonstrate its capabilities and effectiveness through examples and case studies. We will also discuss how Magpie can simplify the software improvement process by isolating the search process from the specific improvement technique. Overall, this presentation will provide an introduction to Magpie and demonstrate how it can be used to improve software in various domains. The framework is freely available online at https://github.com/bloa/magpie

More About The Speaker

16:00 Discussion/Breakout

17:00 Close


Day 2 - 28th March  2023

10:00 Pastries

10:30 Dr. David Clark, University College London

Hypertesting software

Hypertesting is a form of metamorphic testing that can be used to check whether software conforms to a policy or to discover the policy the software does conform to. Since it is a form of testing, the traditional caveat applies: there is no guarantee and exploration discovers what it discovers. Software syntax coverage is, of course, useful but inadequate so test sets need to be extended with high information content tests to improve exploration.

In our approach policies and properties are necessary but can be aimed at anything expressible as a hyper property: e.g. confidentiality, integrity, privacy, and opacity properties. A failing hypertest can be viewed as a witness to the failure of the flow property in the context of the software's policy.

Recent research into hypertesting of confidentiality properties for programs has demonstrated that hypertests together with GI can identify, measure and assist in the automated repair of confidentiality leaks.  Rather than noninterference, a more natural security property to test against (and repair) might be Liskov and Myer's Distributed Label Model where policies are similar to capabilities and hence individual rather than global. 

More About The Speaker

11:00 Dr. Serkan Kirbas, Bloomberg

Automatic Program Repair in Industry: Findings from Bloomberg
During this session, Serkan will share observations and findings from Automatic Program Repair (APR) work at Bloomberg. He will focus on the software engineers’ experience and practical aspects of getting automatically-generated code changes accepted and used in industry. Furthermore, he will discuss the results of qualitative research at Bloomberg, demonstrating the importance of the timing and the presentation of fixes.

11:30  Prof. Mark Harman, Meta/University College London 

Software Improvement Research Challenges: An Industrial Perspective

There have been rapid recent developments in automated software test design, repair and program improvement. For search based approaches to specialisation and improvement, advances in automated software testing go hand-in-hand with advances in improvements. Advances in artificial intelligence also have great potential impact to tackle software engineering automation problems, including specialisation and improvement. In this talk I will highlight open research problems and challenges from an industrial perspective. This perspective draws on experience at Meta Platforms, which has been actively involved in software engineering research and development for over a decade. There are many exciting opportunities for research to achieve the widest and deepest impact on software practice. With this talk, I want to engage with the scientific community, especially on problems in search based automated software improvement for performance optimisation. This talk is partly based on a forthcoming invited paper and keynote talk at the International Conference on Software Testing (ICST 2023).The ICST keynote talk will be given by Nadia Alshahwan and Mark Harman. The paper is by Nadia Alshahwan, Mark Harman and Alexandru Marginean. Thanks to the many Meta engineers, managers and leadership for their assistance, support and work on deploying automated software testing and improvement techniques.

More About The Speaker

12:00 Lunch

13:00 Dr. Vesna Nowack, Imperial College London

Human Factors in Automatically Generated Software

Recently we have seen a significant increase in the number of tools that generate software and help developers in programming. Adopting these tools could change software developers’ daily activities and transform their work practices. For example, Automatic Program Repair (APR) generates bug fixes by applying different techniques (like genetic improvement) and might reduce the manual effort of fixing bugs. To understand the benefits of APR, it is vital that we consider how software engineers feel about APR and the impact it may have on developers’ work.

In this talk, I will show our analysis of human factors in 260 articles in APR literature to understand how developers are considered in APR research. Over half of the reviewed articles were motivated by a problem faced by developers, but fewer than 7% of the reviewed articles included a human study. Our results suggest that software developers are often talked about in APR literature, but are rarely talked with. 

To understand developers' general attitudes to APR or developers' current bug-fixing practices, we carried out a survey of 386 software developers. Our findings show that developers derive satisfaction and benefit from bug fixing and that they prefer being kept in the loop (for example, choosing between different fixes or validating fixes) as opposed to a fully automated process. This suggests that APR should consider the involvement of developers, as well as what information is presented to developers alongside fixes.

More About The Speaker

13:30 James Callan, University College London

Improving the Non-Functional Properties of Android Apps with Genetic Improvement

Due to the limited hardware on which Android Applications are generally run, non-functional properties are particularly important for both developer and users. While Genetic Improvement (GI) has been shown to be able to improve non-functional properties in traditional desktop domains, it has only rarely been applied to Android apps. This talk will present both the successes and failures that we have had in applying GI in the Android domain, attempting to improve frame rate, responsiveness, execution time, memory consumption, and bandwidth usage of apps. The talk will also explore the challenges faced when attempting to apply GI to Android apps, compared with the the traditional domain, and the largest problems which are still faced when applying GI to Android.

More About The Speaker

14:00 Dr. Giovani Guizzo, University College London

SE4AI: Using Regression Test Selection to speed-up GI

Although GI has been proven to be effective in improving functional and non-functional properties of software, it still demands a great deal of computational resources. The main reason for GI's elevated cost is the execution of (potentially thousands of) test cases to validate new patches. In this presentation, I will show how we used a classic SE technique to improve the execution time of GI by selecting only the subset of relevant test cases that can potentially reveal bugs in the patched code, i.e., using Regression Test Selection (RTS) techniques. This presentation shows a fine example of how SE can be applied to AI in order to solve problems, not only the other way around.

More About The Speaker

14:30 Tea/coffee break

15:00 Dr. Maria Kechagia, University College London

Green AI: Do Deep Learning Frameworks Have Different Costs?

The use of Artificial Intelligence (AI), and more specifically of Deep Learning (DL), in modern software systems, is nowadays widespread and continues to grow. At the same time, its usage is energy demanding and contributes to the increased CO2 emissions, and has a great financial cost as well. Even though there are many studies that examine the capabilities of dl, only a few focus on its green aspects, such as energy consumption. This paper aims at raising awareness of the costs incurred when using different dl frameworks. To this end, we perform a thorough empirical study to measure and compare the energy consumption and runtime performance of six different dl models written in the two most popular dl frameworks, namely PyTorch and TensorFlow. We use a well-known benchmark of dl models, DeepLearningExamples, created by nvidia, to compare both the training and inference costs of dl. Finally, we manually investigate the functions of these frameworks that took most of the time to execute in our experiments. The results of our empirical study reveal that there is a statistically significant difference between the cost incurred by the two dl frameworks in 94% of the cases studied. While TensorFlow achieves significantly better energy and runtime performance than PyTorch, and with large effect sizes in 100% of the cases for the training phase, PyTorch instead exhibits significantly better energy and runtime performance than TensorFlow in the inference phase for 66% of the cases, always, with large effect sizes. Such a large difference in performance costs does not, however, seem to affect the accuracy of the models produced, as both frameworks achieve comparable scores under the same configurations. Our manual analysis, of the documentation and source code of the functions examined, reveals that such a difference in performance costs is under-documented, in these frameworks. This suggests that developers need to improve the documentation of their dl frameworks, the source code of the functions used in these frameworks, as well as to enhance existing dl algorithms.

More About The Speaker

15:30 Dr. Max Hort, Simula Research Laboratory

Multi-objective search for gender-fair and semantically correct word embeddings

 Fairness is a crucial non-functional requirement of modern software systems that rely on the use of Artificial Intelligence (AI) to make decisions regarding our daily lives in application domains such as justice, healthcare and education. In fact, these algorithms can exhibit unwanted discriminatory behaviours that create unfair outcomes when the software is used, such as giving privilege to one group of users over another (e.g., males vs. females). Mitigating algorithmic bias during the development life cycle of AI-enabled software is crucial given that any bias in these algorithms is inherited by the software systems using them. However, previous work has shown that mitigating bias can impact the performance of such systems. Therefore, we propose herein a novel use of soft computing for improving AI-enabled software fairness. Specifically, we exploit multi-objective search, as opposed to previous work optimising fairness only, to strike an optimal balance between reducing gender bias and improving semantic correctness of word embedding models, which are at the core of many AI-enabled systems.

More About The Speaker

16:00 Discussion/Breakout

17:00 Close


Dr Aymeric Blot, Universite Du Littoral Côte D'Opale 
Prof Bill Langdon, University College London
Carol Hanna, University College London
Daniel Blackwell, University College London
Dr David Clark, University College London
Dr David Kelly, King's College London
Dr Derek Jones, Knowledge Software
Dr DongGyun Han, Royal Holloway, University of London
Dr Ezekiel Soremekun, Royal Holloway, University of London
Prof Federica Sarro, University College London
Fraser Garrow, Edinburgh Centre for Robotics
Prof George Magoulas, Birkbeck College, University of London
Dr Giovani Guizzo, University College London
Dr Gunel Jahangirova, King's College London
Dr Hector Menendez, King's College London 
Ilaria Pia la Torre, University College London
James Callan, University College London
Jeongju Sohn, University of Luxembourg
Dr Jie Zhang, King's College London
Dr Justyna Petke, University College London
Dr Kelly Androutsopoulos, Middlesex University
Prof Leon Moonen, Simula Research Laboratory
Dr Luca Traini, University of L'Aquila
Dr Maria Kechagia, University College London
Prof Mark Harman, Meta
Dr Markus Wagner, Monash University
Dr Max Hort, Simula Research Laboratory
Prof Myra Cohen, Iowa State University
Dr Oliver Krauss, University of Applied Sciences Upper Austria
Dr Sarah L. Thomson, University of Stirling
Prof Sarfraz Khurshid, University of Texas at Austin
Dr Sergey Mechtaev, University College London
Dr Serkan Kirbas, Bloomberg
Prof Tracy Hall, Lancaster University
Dr Vesna Nowack, Imperial College London
Dr Yue Jia, Meta


cow 63