Data Stewards

A service to collaborate with ARC Data Stewards, who can support your projects through data expertise, quality and management

Key Features

Accessible to UCL researchers
Support to improve data management
Support to improve data quality
Emphasis on making data outputs FAIR
Utilisation of local expertise in data production, management and sharing
Reduced risks regarding data security and privacy requirements
Network with local academic professionals for collaboration on future projects

Overview

Researchers in many fields are creating and analysing complex datasets in ways that enable breakthroughs that would not formerly have been possible. However, sophisticated data processing and analysis can present new challenges regarding research reproducibility and data re-use. The skills and systems needed to collect, manage, analyse, and document research data are becoming increasingly specialized, and not all researchers or research groups have the expertise to apply the latest best practices to their work. Lack of awareness of good data management practices often leads to underutilization of data, bottlenecks, and even data loss. A research project without an active data management plan or the processes in place to enable it can often end up using ad-hoc storage solutions and sub-optimal data pipelines. Requirements eventually exceed the capacity of such hastily built solutions.

Being part of the Research Data Group, the Data Stewards are perfectly suited to integrate UCL's data services within the software systems built for academic research. Our Data Stewards provide consultancy on various levels of data-related problems from data engineering and wrangling to designing distributed data systems. We help identify ways in which data management could be more efficient, and support researchers to maximise the value of data.

Data Stewards support researchers with their research data management, including advocating for open and FAIR data, by acting as the first point of contact for enquiries, a link to central services (eg. the Library, IT, Information Governance) and as experts in data practices in the disciplines we support. Furthermore, because our Data Stewards are employed on permanent contracts, we can alleviate the problems faced by many research groups as they try to recruit people with the appropriate skills and knowledge for short-term temporary contracts. Our data stewards have the opportunity to develop their professional experience and apply that across multiple projects at UCL, keeping useful skills in-house.

Time spent developing new project proposals is not charged for, although significant time working on a research project should be costed into that grant.

Data Steward Case Study: RRED

Github Opportunity

The Reading Recovery Evaluation Database (RRED) is an ongoing cohort study based in the International Literacy Centre at the UCL Institution of Education. Each year, data is added manually to the database by reading recovery teachers from a range of schools, providing an invaluable resource for studying the impact of reading recovery programmes. Over 600 teachers contribute to the database each year.

On August 22^nd 2022, the ARC Research Data team were approached for Data Stewardship support. The data for the project would be hosted through the Data Safe Haven service due to its nature as special category data, and therefore required careful and secure handling. At this point, the funding had yet to be confirmed, but this time allowed for the team to examine the project and assess support needs.

Project Aims

Initially, the project had three aims:

To review (and potentially redesign) the current data management pipeline and processes
To implement an improved data management process, improving reproducibility of processing to avoid manual errors
To provide ongoing support and maintenance

In future meetings some of the priorities were further clarified, as it was determined that priority should be given to the method by which data was exported, pre-processed and filtered from a Redcap database and how it was presented through a personalised report generated automatically and emailed back to schools using user-acceptance testing stages and verifying data. It is the nature of projects for the scope to sometimes become refined later on in the project lifespan, and our stewards are flexible enough to adapt to these changes and continue supporting the project efficiently.

Group Project

Once funding was confirmed on December 22^nd 2022, the project was officially underway – the Data Steward who volunteered to undertake the responsibilities of the project worked with a Research Software Engineer to support the progress towards the project aims. Work was then divided according to each project member’s skillset and tracked through GitHub, assigned and planned in a project board, with a GitHub wiki being used for project management documentation.

Project Work

Due to the nature of the project, the team members came together to determine the exact procedures which must be taken and in which order.
First, the data must be extracted and processed
- This required the conversion of pre-existing R scripts for exporting and reporting into Python code in order to make it more accessible to a broader range of developers. As such it stood a better chance to being understood, expanded upon, and maintained.
  - It was also required that the team add automated tests and set up continuous integration, to make sure every change passes tests and coding conventions.
- Subsequently, the data acquired through the extraction must undergo data quality checks. This would ensure that any errors in the data from previous and current reports were caught as quickly as possible, so that they could be examined and rectified.
- Finally, the code had to fulfil its purpose to generate and send a report.
  - These reports must be populated using a template provided by the research group, which is filled with the relevant data per school and then converted to pdf.
  - Any unsolvable errors must be summarised and reported.
  - Finally, after signing off of reports, the code should automate the delivery of these reports to email addresses tied to representatives from each school.
Specifically, the Data Steward involved in the project was responsible for:
- Designing nested data structures (school, teacher, pupil) and converting BO files to the new structure
- Creating a set of dummy data to work on, to ensure the code could be run with smaller and fake data which could also be published in the GitHub repository. It also allowed for more work outside of the Data Safe Haven environment
- Using the dummy data to represent the dataset from RRED, implement a filter which examines students and staff from one school, and presents a report
- Running the filter and test code on the actual RRED dataset, examining the different tables relating to teacher, dates and with specific outcomes

Deliverable

At the end of the project, the team delivered software which could be run to provide reports filtered for schools which examined data pertaining to a specific survey period. Additionally, this code sends the reports by email to representatives within the school for them to archive and share the findings of the report.

With the code now being as automated as possible, future work can focus on resolving any errors which arise.

If you would like to speak to us about the Data Stewards and the work we do, please write to us at researchdata-support@ucl.ac.uk