Research IT Services


Opinion: The Role of the Data Repository in Supporting Research

19 December 2018

James A J Wilson, Head of Research Data Services at RITS, shares his thoughts on how UCL's upcoming repository for research data will enable Open Access principles to be followed as part of a wider push to support reproducible research.

James A J Wilson

In early 2019, UCL will be launching its own repository for research data. Until now,  researchers working in fields which do not have their own dedicated repositories have needed to deposit their data either with externally-provided services that offer few guarantees of longevity, or endure a painfully manual process of wrangling their data into systems that were never really intended to act as data repositories. The UCL Research Data Repository will provide staff and research students with a straightforward system whereby they can describe and deposit data, enable others to discover that data, and, if permissible, to view and access the data. Uploaded datasets are issued with a Digital Object Identifier (DOI) enabling them to be accurately cited in publications, and due credit given to their creators.

The underlying storage and interfaces to the UCL Research Data Repository will be provided by Figshare, an established repository provider that already offers a popular commercial service. The Repository is, however, a long-term commitment, and designed so that the data and associated records can be migrated between different solution providers, ensuring that the university is not locked in to a particular system forevermore. Indeed, one of the principles underpinning the repository is openness, and UCL has been careful to follow the principles of Open Access and the more recent concept that data should be ‘FAIR’ (Findable, Accessible, Interoperable, and Reusable). Findability and accessibility are core to the repository, but the Repository’s role in facilitating interoperability and reusability are worth noting.

The recently updated UCL Research Data Policy indicates that data should be “as open as possible, as closed as necessary”,[1] and the Repository is built to enable this. When researchers upload data to the repository they will be asked to apply a particular licence to them. This can be anything, including custom licences, but the use of an open licence is strongly encouraged with a Creative Commons CC0 licence being the default.[2] This is the most open commonly-applied licence and enables others to reuse data with the greatest freedom. One of the reasons for this choice is that it helps with the interoperability of data and records. More restrictive licences can unwittingly restrict the machine-readability and re-use of data, and, with people in a number of disciplines increasingly combining and analysing large collections of data, CC0 is a licence that does not impair one’s capability to conduct data-driven research.

It will likely be several years before the potential benefits of such an infrastructure start to become apparent, but the launch of the Research Data Repository will mark a significant milestone on the journey.

Researchers concerned about publishing their data so openly may be reassured that the Repository enables data to be embargoed for a length of time – enabling citation of the data underpinning their publications whilst not allowing the data itself to be downloaded by others who could potentially beat the originators to further publications. General access to the data is only enabled once the data creators have had the chance to exploit it themselves.

Data that needs to remain confidential, perhaps because of agreements with industrial funders for instance, can remain so. A record for the dataset informs others of its existence, but access to the dataset itself can be appropriately restricted.

Interoperability is not just of interest to those engaged with data-driven research. It is also central to the development of IT systems, enabling the automation of processes that can ultimately support researchers by reducing the burden of documentation they face when producing reproducible research.

The so-called crisis in reproducibility has become a significant concern across a number of disciplines over the past ten years or so, and several studies have indicated that one of the best ways to hinder reproducibility is to deny access to the data underlying research. But simply making the data available only goes so far to improving reproducibility. In order to really improve matters, data needs to be documented, and one way in which this can be done without significant additional effort on the part of the researcher is to automate the capture of provenance metadata.

The Research Data Service team (part of Research IT Services in ISD) is working towards a longer-term goal to capture an audit trail of the processes that data has been through. If it were possible to trace the path that a dataset has been through since its creation, and pass this information through to its record in the Data Repository, this could help address some of the complexities of reproducibility. In this respect, the Research Data Repository forms one element of a more ambitious infrastructure which UCL is beginning to develop.

Imagine for instance a research workflow in which data is gathered or generated by a piece of equipment, stored in the Research Data Storage Service (where project information can be added), processed by software code using one of UCL’s high-performance compute facilities, with the resulting data then underpinning a journal article. At present it requires effort to accurately and unambiguously describe the process by which the data was created in the research paper to a degree that assists reproducibility. If, on the other hand, the settings of the equipment could be captured and recorded alongside the raw data, the environment in which the data was processed could be containerized and added to the metadata, the version of the software code that was used to process the data could be attached or referenced, and the journal article linked to the underlying data, along with all the other provenance details, and all this were made open, it would in many cases significantly improve reproducibility. A good data repository is where all this information can be held and made available more widely.

By integrating research IT tools and services where we can, we are at least edging towards improved reproducibility.

It will not be easy to build an infrastructure that can automate the gathering of metadata from all steps of a research workflow, many of which require domain expertise or processes that are outside the control of a centrally organized IT infrastructure. By integrating research IT tools and services where we can, we are at least edging towards improved reproducibility.

The UCL Research Data Repository is, therefore, not merely intended as a service that can help researchers meet funder requirements for data sharing, but forms a key part of a larger in-development infrastructure that can support the movement of research data through its lifecycles, accumulating rich metadata that can help people understand, reproduce, and re-use that data in future.

It will likely be several years before the potential benefits of such an infrastructure start to become apparent, but the launch of the Research Data Repository will mark a significant milestone on the journey.

James A J Wilson,
Head of Research Data Services