Read our staff contributions to see the range of research areas, from stochastic analysis and the theory of stochastic processes to more applied topics including the analysis and modelling of data.
The Need for Standardized Datasets for Financial Machine Learning
Dr Camilo Garcia Trillos

One of the reasons behind this slower pace is that, despite the impressive results often reported in research papers on applications like portfolio selection or hedging, independent verification remains a challenge. Reproducibility issues stemming from unclear implementation details and the subjectivity surrounding the relevance of historical data further complicate matters. Further, as we all know, in finance, we grapple with the inherent limitation of "one path history" per asset, making it difficult to assess the robustness of models trained on limited datasets. These characteristics create uncertainty on the quality of a given approach, both for practitioners and regulators. It also makes it difficult to distinguish promising approaches from less effective ones.
This necessitates the creation of a standard repository of synthetic financial data, meticulously crafted to mirror real-world complexities. Imagine high-quality synthetic datasets encompassing not only price histories but also additional factors like economic indicators or even sentiment. Such a repository could facilitate robust model development and validation, enabling researchers and practitioners alike to compare methodologies and build more reliable financial applications. I believe the role of such a database would be akin to that of MNIST or CIFAR in image classification: to serve as an essential reference for evaluation of the performance of any given approach, making it easier for practitioners to recognise promising approaches and for regulators to understand the quality of given models that might be, otherwise, opaque. Such a standardized dataset has the potential to accelerate development and simplify comparisons between models, all of which will contribute to the development of more robust and reliable applications in practice.
All of the above is why, myself and some other members of our group have started the process of designing such a dataset. Certainly, there are several challenges for this proposal to go through: theoretical, such as deciding what properties are desirable for such a dataset, but also practical once the implementation stage is reached, related to size and cost. However, we believe that this initiative will significantly benefit the entire quantitative finance community by fostering transparency, reproducibility, and ultimately, more robust and impactful financial applications.
We welcome your ideas and support in making this vision a reality. Do not hesitate to contact us to share your curiosity, provide feedback, or to propose ways to support it.
Dr Camilo Garcia Trillos