Why Machine Learning Quants Need “Golden” Datasets

Today’s computers are able to tell the difference between all sorts of everyday things – cats and dogs, fire hydrants and traffic lights – because people have painstakingly cataloged 14 million of these images, by hand, for computers to learn. Quants think finance needs something similar.

Tagged images used to train and test image recognition algorithms reside in a publicly available database called ImageNet. This has been essential to improve these algorithms. Developers can gauge their progress based on their success rate in correctly categorizing ImageNet images.

Without ImageNet, it would be much harder to tell if one model beats another.

Finances are no different. Like all machine learning models, those used to invest or hedge reflect the data from which they learned. Thus, comparing models trained on different data can tell a lot about the data, but much less about the models themselves.

Measuring a company’s machine learning model against other known models in the industry, or even against different models from the same organization, becomes nearly impossible.

So the idea is to create shared datasets that quants could use to weigh models against each other. In finance, however, this is a more complex task than simply collecting and labeling images.

On the one hand, banks and investment firms are reluctant to share proprietary data, sometimes for privacy reasons, often because the data has too much commercial value. Such reluctance can make gathering raw information for benchmark datasets a challenge from the start.

Second, new “gold” datasets would need masses of data covering all market scenarios – including scenarios that have never actually happened in history.

This is a well-known problem affecting machine learning models trained on historical data. In financial markets, the future rarely resembles the past.

If the dataset on which you train your model resembles the data or scenarios it encounters in real life, you’re in business. If it’s very different, you don’t know what the model will do

Blanka Horvath, Technical University of Munich

“If the dataset on which you train your model resembles the data or scenarios it encounters in real life, you’re in business,” says Blanka Horvath, professor of mathematical finance at the Technical University of Munich. “If it’s significantly different, you don’t know what the model is going to do.”

The solution to both problems, the quants think, might be to create some of the reference data themselves.

Horvath, with a team at TUM‘s Data Science Institute, has launched a project called SyBenDaFin – synthetic reference datasets for finance – to achieve this.

The plan is to formulate benchmark data sets that reflect what has happened in the markets in the past, but also what could have happened, even if this is not the case.

Summarizing data in this way is increasingly common in finance. Horvath, in another project, performed tests on machine learning deep hedging engines, for example, by training a model on synthetic data and comparing its output to a conventional hedging approach.

Quants say it would be too complex to formulate a universal dataset comparable to ImageNet for all types of financial models.

Market patterns that would test a pattern that rebalances every few seconds, for example, would be different from events that would challenge a pattern trading on a monthly horizon.

Instead, the idea would be to create multiple datasets, each designed to test models created for a specific use.

Benchmarks could help practitioners understand the strengths and weaknesses of models as well as determine whether or not changes to a model lead to improvement.

Regulators, too, will benefit. Potentially, they could train models using the benchmark data and see how they perform against the same model trained on a company’s internal data.

In a paper published last year, authors from the Alan Turing Institute and the Universities of Edinburgh and Oxford said industry today had little knowledge about the suitability or optimization of different methods. machine learning in different cases. A “clear opportunity” exists for finance to use synthetic data generators in benchmarking, they wrote.

“Companies are increasingly relying on black box algorithms and methods,” says Sam Cohen, one of the authors and associate professor at the Mathematical Institute at the University of Oxford and the Alan Turing Institute. . “It’s a way to check our understanding of what they’re actually going to do.”

Previous BMO Equal Weight US Banks Index ETF (TSE:ZBK) hits fresh 12-month low at $27.93
Next No EU membership talks soon, and that's Bulgaria's fault – POLITICO