Centre for Research on Evolution, Search and Testing (CREST)


The 65th CREST Open Workshop - Automated Program Repair and Genetic Improvement

18 September 2023–19 September 2023, 10:00 am–5:00 pm

COW65 Group Picture

Event Information

Open to



Sold out


Dr Justyna Petke, Dr Sergey Mechtaev, Dr Maria Kechagia, Nikhil Parasaram, CREST Centre, SSE Group, Department of Computer Science, UCL, UK


Program repair has the potential to reduce the significant manual effort that developers devote to finding and fixing software bugs. Recent years have witnessed a dramatic growth of research in program repair. Researchers have proposed a large number of techniques aimed to address fundamental challenges of program repair such as scalability and test-overfitting, and have successfully deployed program repair in industry. One of the techniques that has been used in the repair field has been genetic improvement. GI uses automated search in order to improve existing software. Aside from bug fixing, GI has been used to optimise other software properties, such as runtime, memory and energy consumption. It has also been used for other kinds of improvement such as specialising and porting. The goal of this workshop is to reflect on the progress that the research community has made over the last years in those two closely related fields, share experience in research and deployment, and identify key challenges that need to be addressed in future research.

All talks at this workshop are by invitation only. Talks will be a maximum of 20 minutes long with plenty of time for questions and discussion. We also hope that the workshop will foster and promote collaboration, and there will be time set aside to support this.

Participants are expected to attend the whole event in person since the workshop is interactive and discursive. There is no registration fee, due to kind support of UKRI EPSRC's fellowship grant on "Automated Software Specialisation Using Genetic Improvement" Light lunches, will be included, along with the usual refreshments all at no charge.

The registration of interest is closed.

Policy on Student Registrations

We welcome registrations from PhD students, where the student is pursuing a programme of research for which the COW will provide intellectual benefit and/or from whom the workshop and its other attendees will gain benefit. We do not normally expect to register students other than those on PhD level programmes of study. For example, those students taking a course at the equivalent of UK masters or bachelors level would not, ordinarily, be considered eligible to register for COW. However, we are willing to consider exceptional cases, where a masters or bachelors student has a clear contribution to make to the topic of the COW. In all cases, students must have the approval of their supervisor/advisor for their attendance at the COW and their consent to the terms of registration. This is why we ask that students seeking to register for a COW also supply the contact details of their supervisor.

Cancellation Fee

Please appreciate that numbers are limited and catering needs to be booked in advance, so registration followed by non-attendance will cause difficulties. For this reason, though the workshop is entirely free of charge, there will be a cancellation fee of £100 for those who register but subsequently fail to attend.

Location: 66-72 Gower Street, Room G.01


Day 1 - 18th September  2023

10:00 Welcome and Introductions

10:30 Dr. Saemundur Haraldsson, University of Stirling

Will programming become an obsolete skill?

Every significant technological progress that increases productivity and/or efficiency in any industry has made at least a few people worry about their jobs. Our work on automatic software improvement, including APR and GI, is no exception.

In this talk I will discuss the evolution of software improvement technologies and their integration and acceptance in industry through my eyes as a software developer, an academic, and an educator.

I will pose some questions of academic interest as well as for industry and the education of the workforce by reflecting on a limited historical point of view to speculate on what lies ahead with the recent rapid deployment of powerful generative-AI tools.

How might our work shape the workforce of the future?

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/gf219aHc


    More About The Speaker

11:00 Nikhil Parasaram, University College London

Rete: Learning Namespace Representation for Program Repair

A key challenge of automated program repair is finding correct patches in the vast search space of candidate patches. Real-world programs define large namespaces of variables that considerably contributes to the search space explosion. Existing program repair approaches neglect information about the program namespace, which makes them inefficient and increases the chance of test-overfitting.

We propose Rete, a new program repair technique, that learns project-independent information about program namespace and uses it to navigate the search space of patches. Rete uses a neural network to extract project-independent information about variable CDU chains, def-use chains augmented with control flow. Then, it ranks patches by jointly ranking variables and the patch templates into which the variables are inserted.

We evaluated Rete on 142 bugs extracted from two datasets, ManyBugs and BugsInPy. Our experiments demonstrate that Rete generates six new correct patches that fix bugs that previous tools did not repair, an improvement of 31% and 59% over the existing state of the art.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/CDIBGEhG


   More About The Speaker

11:30 Dr. Jie Zhang, Kings College London

Quantifying the Threat to Empirical Software Engineering Validity from LLM Non-determinism: A Comprehensive Study of ChatGPT

There has been a recent explosion of research on Large Language Models (LLMs) for software engineering tasks, in particular code generation. However, results from LLMs can be highly unstable; nondeterministically returning very different codes for the same prompt. Non-determinism is a potential menace to scientific conclusion validity. When non-determinism is high, scientific conclusions simply cannot be relied upon unless researchers change their behaviour to control for it in their empirical analyses. In this talk, I will introduce our empirical study which demonstrates that non-determinism is, indeed, high, thereby underlining the need for this behavioural change. We choose to study ChatGPT because it is already highly prevalent in the code generation research literature. We report results from a study of 829 code generation problems from three code generation benchmarks (i.e., CodeContests, APPS, and HumanEval). Our results reveal high degrees of non-determinism: the ratio of problems with zero equal test output among code candidates is 72.73%, 60.40%, and 65.85% for CodeContests, APPS, and HumanEval, respectively. In addition, we find that setting the temperature to 0 does not guarantee determinism in code generation, although it indeed brings less non-determinism than the default configuration (temperature=1). These results confirm that there is, currently, a significant threat to scientific conclusion validity. In order to put LLM-based research on firmer scientific foundations, researchers need to take into account non-determinism in drawing their conclusion.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/11AI61di


   More About The Speaker

12:00 Lunch

13:00 Dr. Serkan Kirbas, Bloomberg

Automatic Program Repair in Bloomberg

During this session, Serkan will share findings and observations from Automatic Program Repair (APR) work at Bloomberg. He will focus on the software engineers’ experience and practical aspects of getting automatically-generated code changes accepted and used in industry. Furthermore, he will discuss the results of qualitative research at Bloomberg, demonstrating the importance of the timing and the presentation of fixes.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/e5dbAHeA


    More About The Speaker

13:30 Dr. Vesna Nowack, Imperial College London

Expanding Fix Patterns to Enable Automatic Program Repair

Recent program repair tools have applied learned templates (fix patterns) to fix code using knowledge from fixes successfully applied in the past. However, there is still no general agreement on the representation of fix patterns, making their application and comparison with a baseline difficult. As a consequence, it is also difficult to expand fix patterns and further enable Automated Program Repair (APR).

In this presentation, I’ll show how to automatically generate fix patterns from similar fixes and compare the generated fix patterns against a state-of-the-art taxonomy. Our automated approach splits fixes into smaller, method-level chunks and calculates their similarity. A threshold-based clustering algorithm groups similar chunks and finds matches with state-of-the-art fix patterns. In our evaluation, we present 33 clusters whose fix patterns were generated from the fixes of 835 Defects4J bugs. Of those 33 clusters 22 matched a state-of-the-art taxonomy with good agreement. The remaining 11 clusters were thematically analysed and generated new fix patterns that expanded the taxonomy. Our new fix patterns should enable APR researchers to expand their tools to fix a greater range of bugs in the future.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/aEBg0i0I

14:00 Sebastian Schweikl, University of Passau

An Empirical Study of Automated Repair for Block-Based Learners' Programs

Specialised programming environments dedicated to programming education
continue to gain traction, with Scratch among the most popular ones. Its
block-based nature helps prevent syntax errors but learners can still
make semantic mistakes in countless ways. In light of large classrooms,
an automated solution that uses custom repairs to provide hints is
desirable. We introduce, to the best of our knowledge, the first
prototype that enables such repairs for Scratch. Our approach is
implemented as part of the Whisker testing framework for Scratch, and
uses Genetic Improvement to evolve program variants guided by a
test suite. In a preliminary study, we evaluate Whisker on a set
of programs with 14.6 bugs on average, taken from a real-word
classroom setting. The results show Whisker is able to fix up to
14 bugs per project, with 3.7 fixes on average. The peculiarities of
Scratch present unique challenges that remain to be addressed,
such as the high number of bugs per project, long test executions,
flat fitness landscapes caused by very small test suites, and small
projects limiting the scope from which fixes can be drawn during

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/0j14dabc


    More About The Speaker

14:30 Refreshments

15:00 Dr. Gunel Jahangirova, King's College London

Repairing DNN Architecture: Are We There Yet?

As Deep Neural Networks (DNNs) are rapidly being adopted within large software systems, software developers are increasingly required to design, train, and deploy such models into the systems they develop. Consequently, testing and improving the robustness of these models have received a lot of attention lately. However, relatively little effort has been made to address the difficulties developers experience when designing and training such models: if the evaluation of a model shows poor performance after the initial training, what should the developer change? We survey and evaluate existing state-of-the-art techniques that can be used to repair model performance, using a benchmark of both real-world mistakes developers made while designing DNN models and artificial faulty models generated by mutating the model code. The empirical evaluation shows that random baseline
is comparable with or sometimes outperforms existing state-of-the-art techniques. However, for larger and more complicated models, all repair techniques fail to find fixes. Our findings call for further research to develop more sophisticated techniques for Deep Learning repair.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/b72b75EI

15:30 Dr. David Clark, University College London

Causing the Repair of Hyperproperties

Hyperproperties are currently attracting interest from the programming languages research community both in terms of formal, semantics based analysis as well as testing and dynamic analysis. People I work with have been developing methods to automatically detect hyperproperty violations then to measure the extent of interference using information theory and finally to apply GI in order to repair the violations.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/G6Jei12a


    More About The Speaker

16:00 Discussion/Breakout

17:00 Close


Day 2 - 19th September  2023

10:00 Refreshments

10:30 Dr. Claire Le Goues, Carnegie Mellon University

Automatic repair of client code in light of evolving APIs

Modern software engineering revolves around the use of third-party libraries and APIs. Changing or upgrading libraries that a client project depends on is tedious and error-prone, to the point that many developers simply don't.  In this talk, I will discuss our recent work on inferring and applying code transformations to automatically update client code in light of evolving libraries.  These techniques rely on a careful interplay between powerful advances in ML/NLP models to manage the search space, and more traditional symbolic approaches to support transformation correctness and quality. 

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/abh9GfiD

11:00 Prof. Tegawendé F. Bissyandé, University of Luxembourg

Automated Test-free Repair

Most program repair techniques rely on test cases as a key ingredient for driving patch generation and validation. Test cases have been successfully leveraged to automate the repair of many bug classes, but relying on them has hindered progress on repairing vulnerabilities. Their scarcity, exacerbated by the fact that, for vulnerabilities, test cases are also exploits, forms a “sound” barrier that we propose to break in our work. Instead of tests, we suggest leveraging other signals such as the ones that can be found in the vulnerability detection and fix suggestions output of static analysis
security testing (SAST) to learn to repair vulnerable code.



11:30 Dr. Matias Martinez,  Universitat Politècnica de Catalunya-Barcelona Tech 

Energy Consumption of Automated Program Repair: From
Search-based to Large Language Model-based Repair Tools

Automated program repair (APR) aims to automatize the process of repairing software bugs in order to reduce the cost of
maintaining software programs. Moreover, the success (given by the accuracy metric) of APR approaches has been increasing in recent
years. However, no previous work has considered the energy impact of repairing bugs automatically using APR. The field of green software
research aims to measure the energy consumption required to develop, maintain and use software products. This paper combines, for the first time, the APR and Green software research fields. We have as main goal to define the foundation for measuring the energy consumption of the APR activity. We measure the energy consumption of ten traditional program repair tools for Java and ten fine-tuned Large-Language Models (LLM) on source code trying to repair real bugs from Defects4J, a set of real buggy programs. The initial results from this experiment show the existing trade-off between energy consumption and the ability to correctly repair bugs: Some APR tools are capable of achieving higher accuracy by spending less energy than other tools.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/E7giiFd6


    More About The Speaker

12:00 Lunch

13:00 Dr. Michele Tufano, Microsoft

InferFix: End-to-End Program Repair with LLMs

Software development life cycle is profoundly influenced by bugs: their introduction, identification, and eventual resolution account for a significant portion of software cost. This has motivated software engineering researchers and practitioners to propose different approaches for automating the identification and repair of software defects. Large language models have been adapted to the program repair task through few-shot demonstration learning and instruction prompting, treating this as an infilling task. However, these models have only focused on learning general bug-fixing patterns for uncategorized bugs mined from public repositories. In this paper, we propose InferFix: a transformer-based program repair framework paired with a state-of-the-art static analyzer to fix critical security and performance bugs. InferFix combines a Retriever -- transformer encoder model pretrained via contrastive learning objective, which aims at searching for semantically equivalent bugs and corresponding fixes; and a Generator -- a large language model (Codex Cushman) finetuned on supervised bug-fix data with prompts augmented via bug type annotations and semantically similar fixes retrieved from an external non-parametric memory. To train and evaluate our approach, we curated InferredBugs, a novel, metadata-rich dataset of bugs extracted by executing the Infer static analyzer on the change histories of thousands of Java and C# repositories. Our evaluation demonstrates that InferFix outperforms strong LLM baselines, with a top-1 accuracy of 65.6% for generating fixes in C# and 76.8% in Java. We discuss the deployment of InferFix alongside Infer at Microsoft which offers an end-to-end solution for detection, classification, and localization of bugs, as well as fixing and validation of candidate patches, integrated in the continuous integration pipeline to automate the software development workflow.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/fIh6IgFD


    More About The Speaker

13:30 Prof. Leon Moonen, Simula Research Laboratory

SEIDR - Fully Autonomous Programming with Large Language Models

Current approaches to program synthesis with Large Language Models (LLMs) exhibit a "near miss syndrome": they tend to generate programs that semantically resemble the correct answer (as measured by text similarity metrics or human evaluation), but achieve a low or even zero accuracy as measured by unit tests due to small imperfections, such as the wrong input or output format. 

To address this challenge, we have recently proposed SEIDR (Synthesize, Execute, Instruct, Debug and Rank), an approach in which a draft solution is generated first, followed by an iterative program repair loop addressing the failed tests. To effectively apply this approach to instruction-driven LLMs, one needs to determine which prompts perform best as instructions for LLMs, and strike a balance between repairing unsuccessful programs and replacing them with newly generated ones. 

We explore these trade-offs empirically, comparing replace-focused, repair-focused, and hybrid debug strategies, as well as different template-based and model-based prompt-generation techniques. We use OpenAI Codex as the LLM and Program Synthesis Benchmark 2 as a database of problem descriptions and tests for evaluation. The resulting framework outperforms both conventional usage of Codex without the repair phase and traditional genetic programming approaches.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/c8bg6h48


    More About The Speaker

14:00 Prof. Lin Tan, Purdue University

Customized Models or Generic Code Language Models (for automated program repair and more)

Deep learning techniques have unique advantages in automating challenging software development tasks. I will share our initial investigation to understand the tradeoffs of building customized versus generic deep-learning models for tasks including fix software bugs automatically. First, existing models do not have software domain knowledge such as code semantics or syntaxes. Our code-aware deep learning techniques add such domain knowledge to models to fix software bugs more effectively. For example, our domain-rule distillation technique leverages syntactic and semantic rules and teacher-student distributions to explicitly inject the domain knowledge into the decoding procedure during both the training and inference phases. This approach outperforms existing program-repair techniques on three widely-used benchmarks. Second, I will discuss our comparison of customized models versus generic code language models for the same task. Some surprising results include that the best (generic) code language model as is, fixes 72% more bugs than the state-of-the-art deep-learning-based program repair techniques. 


14:30 Refreshments

15:00 Prof. Mark Harman, Meta/University College London

Large Language Models for Software Engineering: Survey and Open Problems

This talk review the forthcoming survey of the emerging area of

Large Language Models (LLMs) for Software Engineering (SE), which will appear in the proceedings of the ICSE 2023 Future of Software Engineering track.

It sets out open research challenges for the application of LLMs to technical problems faced by software engineers.

LLMs' emergent properties  bring novelty and creativity with applications right across the spectrum of Software Engineering activities including coding, design, requirements, repair, refactoring, performance improvement, documentation and analytics.

However, these very same emergent properties also pose significant technical challenges; we need techniques that can reliably weed out incorrect solutions, such as hallucinations.

Our survey reveals the pivotal role hybrid techniques (traditional SE plus LLMs) have to play in the development and deployment of reliable, efficient and effective LLM-based SE.

The talk will focus on the way in which LLM-based code generation naturally fits within an overall genetic improvement framework.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/49IGB22H


    More About The Speaker

15:30 Dr. Lingming Zhang, University of Illinois Urbana-Champaign

Automated Program Repair in the Era of Large Language Models

Automated Program Repair (APR) aims to help developers automatically patch software systems. Existing traditional and learning-based APR techniques typically rely on high-quality bug-fixing datasets to craft repair templates or directly predict potential patches based on Neural Machine Translation (NMT). Meanwhile, such bug-fixing datasets can be extremely hard to construct, limited in size, and may also contain various irrelevant/noisy commits or changes. As a result, it is hard for existing APR techniques to fix complicated bugs unseen or hard to generalize from such bug-fixing datasets.

In this talk, I will talk about AlphaRepair, the first approach to reformulating the APR problem as a cloze (or infilling) task based on the recent Large Language Models (LLMs) trained on billions of text/code tokens. Our main insight is that instead of modeling what a repair edit should look like (i.e., a NMT task), we can directly use LLMs to predict what the correct code is based on the surrounding contexts of the buggy code (i.e., a cloze task). Such cloze-style APR can completely free APR from historical bug fixes and leverage the massive pre-training corpora of LLMs for multi-lingual APR. Our AlphaRepair study also demonstrates that LLMs can outperform existing APR techniques studied for over a decade. Lastly, I will also briefly talk about our other work on LLM-based APR, including ChatRepair, a conversational APR approach based on the very recent ChatGPT model.

MediaCentral Widget Placeholderhttps://mediacentral.ucl.ac.uk/Player/c3e1IdFD


    More About The Speaker

16:00 Discussion/Breakout

17:00 Close


Dr Alexandros Tasos, Bayforest Technologies
Prof Bill Langdon, University College London
Dr Sandy Brownlee, University of Stirling
Dr Ezekiel Soremekun, Royal Holloway, University of London
Prof Leon Moonen, Simula Research Laboratory
Dr Saemundur Haraldsson, University of Stirling
Dr David Kelly, King’s College London
Dr Zhenpeng Chen, University College London
Daniel Blackwell, University College London
Dr Derek Jones, Knowledge Software
Dr. Mike Papadakis, University of Luxembourg
Elisa Braconaro, Università degli Studi di Padova
Dr. Eleonora Losiouk, Università degli Studi di Padova
Carol Hanna, University College London
Ilaria Pia la Torre, University College London 
Prof Darrell Whitley, Colorado State University
Prof Gabriela Ochoa, University of Stirling
Dr DongGyun Han, Royal Holloway, University of London
Dr Jie Zhang, King's College London
Dr Justyna Petke, University College London
Prof Federica Sarro, University College London
Dr Maria Kechagia, University College London
Prof Mark Harman, Meta and University College London
Dr Michele Tufano, Microsoft
Dr Sergey Mechtaev, University College London
Dr Serkan Kirbas, Bloomberg
Dr Vesna Nowack, Imperial College London
Sebastian Schweikl, University of Passau
Dr Gunel Jahangirova, King's College London
Dr David Clark, University College London
Dr Claire Le Goues, Carnegie Mellon University
Prof Tegawendé F. Bissyandé, University of Luxembourg
Dr Matias Martinez,  Universitat Politècnica de Catalunya-Barcelona Tech 
Prof Lin Tan, Purdue University
Nikhil Parasaram, University College London
Dr Lingming Zhang, University of Illinois Urbana-Champaign
Andre Silva, KTH
Han Fu, KTH and Ericsson
Shuyin Ouyang, King's College London
Yonghao Wu, King's College London