🤖 AI Summary
Financial generative models lack a unified, quantitative evaluation paradigm—particularly for limit-order-book (LOB) message-level generation. This paper introduces LOB-Bench, the first dedicated benchmark for evaluating generative AI in LOB modeling. It enables multidimensional quantitative assessment across distributional statistics, market microstructure properties, and event-driven market impact. We innovatively define conditional and unconditional statistical consistency metrics under the LOSTER format and introduce, for the first time, market-impact measures—including price response functions and event cross-correlations. The framework integrates multivariate statistical tests, discriminator-based scoring, and event-driven modeling, implemented in Python. Empirical results demonstrate that autoregressive generative models significantly outperform traditional parametric models and (C)GANs in both statistical fidelity and market-dynamic realism, thereby establishing a standardized evaluation foundation for financial generative modeling.
📝 Abstract
While financial data presents one of the most challenging and interesting sequence modelling tasks due to high noise, heavy tails, and strategic interactions, progress in this area has been hindered by the lack of consensus on quantitative evaluation paradigms. To address this, we present LOB-Bench, a benchmark, implemented in python, designed to evaluate the quality and realism of generative message-by-order data for limit order books (LOB) in the LOBSTER format. Our framework measures distributional differences in conditional and unconditional statistics between generated and real LOB data, supporting flexible multivariate statistical evaluation. The benchmark also includes features commonly used LOB statistics such as spread, order book volumes, order imbalance, and message inter-arrival times, along with scores from a trained discriminator network. Lastly, LOB-Bench contains"market impact metrics", i.e. the cross-correlations and price response functions for specific events in the data. We benchmark generative autoregressive state-space models, a (C)GAN, as well as a parametric LOB model and find that the autoregressive GenAI approach beats traditional model classes.