XClose

Mathematics

Home
Menu

Meet our staff

Read our staff contributions to see the range of research areas, from stochastic analysis and the theory of stochastic processes to more applied topics including the analysis and modelling of data.

The Need for Standardized Datasets for Financial Machine Learning

Dr Camilo Garcia Trillos

Dr Camilo Garcia Trillos
Machine learning is revolutionizing many fields including mathematical finance and quantitative analysis. The research literature in the area in recent years shows various machine learning techniques developed to support portfolio investment, market hedging, risk measurement, scenario generation, fraud detection, model calibration, etc[^1]. Many of you are surely well-aware of production implementations of some of these ideas, although with some notable exceptions, the pace of implementation has been less frantic.

One of the reasons behind this slower pace is that, despite the impressive results often reported in research papers on applications like portfolio selection or hedging, independent verification remains a challenge. Reproducibility issues stemming from unclear implementation details and the subjectivity surrounding the relevance of historical data further complicate matters. Further, as we all know, in finance, we grapple with the inherent limitation of "one path history" per asset, making it difficult to assess the robustness of models trained on limited datasets. These characteristics create uncertainty on the quality of a given approach, both for practitioners and regulators. It also makes it difficult to distinguish promising approaches from less effective ones.

This necessitates the creation of a standard repository of synthetic financial data, meticulously crafted to mirror real-world complexities.  Imagine high-quality synthetic datasets encompassing not only price histories but also additional factors like economic indicators or even sentiment. Such a repository could facilitate robust model development and validation, enabling researchers and practitioners alike to compare methodologies and build more reliable financial applications. I believe the role of such a database would be akin to that of MNIST or CIFAR in image classification: to serve as an essential reference for evaluation of the performance of any given approach, making it easier for practitioners to recognise promising approaches and for regulators to understand the quality of given models that might be, otherwise, opaque. Such a standardized dataset has the potential to accelerate development and simplify comparisons between models, all of which will contribute to the development of more robust and reliable applications in practice.

All of the above is why, myself and some other members of our group have started the process of designing such a dataset. Certainly, there are several challenges for this proposal to go through:  theoretical, such as deciding what properties are desirable for such a dataset, but also practical once the implementation stage is reached, related to size and cost. However, we believe that this initiative will significantly benefit the entire quantitative finance community by fostering transparency, reproducibility, and ultimately, more robust and impactful financial applications. 

We welcome your ideas and support in making this vision a reality. Do not hesitate to contact us to share your curiosity, provide feedback, or to propose ways to support it.

Dr Camilo Garcia Trillos

camilo.garcia@ucl.ac.uk