🤖 AI Summary
This paper addresses three core challenges in modeling financial limit order book (LOB) message streams: irregular event timing, rapid regime shifts, and high-frequency traders’ dynamic responses to visible order flow. To this end, we propose the first general-purpose foundational encoder specifically designed for message-level LOB modeling. Innovatively adapting the BERT architecture to LOB data, we introduce a multidimensional message tokenization scheme that encodes each message—comprising price, size, and timestamp—into a single token, while preserving temporal and numerical properties via continuous embeddings. The model is built upon a Transformer encoder and operates natively on asynchronous event sequences without discretization or fixed-length windows. Evaluated on mid-price movement prediction and next-message classification, it achieves state-of-the-art performance while reducing required context length by over 50% compared to prior approaches, significantly improving computational efficiency and out-of-distribution generalization.
📝 Abstract
Modeling the dynamics of financial Limit Order Books (LOB) at the message level is challenging due to irregular event timing, rapid regime shifts, and the reactions of high-frequency traders to visible order flow. Previous LOB models require cumbersome data representations and lack adaptability outside their original tasks, leading us to introduce LOBERT, a general-purpose encoder-only foundation model for LOB data suitable for downstream fine-tuning. LOBERT adapts the original BERT architecture for LOB data by using a novel tokenization scheme that treats complete multi-dimensional messages as single tokens while retaining continuous representations of price, volume, and time. With these methods, LOBERT achieves leading performance in tasks such as predicting mid-price movements and next messages, while reducing the required context length compared to previous methods.