Taming the data pipeline

Faced with an increasingly time consuming and unreliable workflow, researchers from the Bartlett School of Environment, Energy & Resources turned to the Research Software Development Group for help

6 February 2020

Background

This project, led by Dr Paul Dodds from the Bartlett School of Environment, Energy and Resources, aims to assess the value of UK interconnectors to EU-27 countries and Norway, examining both the GB and Single Irish electricity markets. The analysis involves a data pipeline that will aggregate data for different sectors and energy products for each of the relevant countries and produce some visualisations to show the results of this analysis to different audiences.

A patchwork problem

The Research Software Development Group (RSDG) were brought on board with the project to help improve the existing data analysis pipeline which, owing to a long history of development by different researchers over several years, had become over complicated and difficult to understand. Three main issues with the pipeline were:

A large part of the analysis was carried out using an Excel template which had grown in complexity to the extent that it was difficult to keep track of calculations and identify potential errors.
Inconsistent formatting of the input data meant that pre-processing had to be done manually in order to filter, aggregate, and arrange the data into the correct format for the excel template.
The analysis also involved the use of a proprietary, closed-source software product which could not be validated.

Unfortunately, RSDG were not able to tackle all of these problems in the time available so they focused on the problem of pre-processing the input data.

Off to a good start

Most of the raw data comes from the International Energy Agency (IEA) and various European projects, which provide data on energy products and their use across different sectors and countries. The format of the data coming from those different sources is varied, and has also changed over time, making it difficult to automate the process of getting the data into the Excel templates.

RSDG developed a tool in Python to automate the upload process using the Pandas library to read the input files and convert them into a dataframe format where the data could be manipulated and then imported into Excel. This work also involved writing a set of dictionaries and user-friendly configuration files, which would describe how to understand input files from different times and sources, as well as how to convert the data for different use cases. Once done though, the user simply has to enter the desired input and output formats in a master configuration file and run the program in order to preprocess the data.

Next steps

The work done so far by RSDG is a crucial first step in simplifying and rationalizing the research pipeline, and is bound to save a lot of time and effort on the part of the researchers. Automation is not just about saving time however; this part of the analysis pipeline is now fully documented, tested, and version-controlled, reducing the potential for errors and improving the reproducibility of the analysis.

RSDG also set up a private GitHub repository for the code, where the research group can manage access and collaborate on development in future. By introducing best practices such as version control and unit testing, RSDG have created a solid foundation for tackling some of the other issues identified in the future.

Taming the data pipeline

Background

A patchwork problem

Off to a good start

Next steps

Links