🤖 AI Summary
This study addresses the online market-making problem under a partially observable feedback mechanism, where observations depend solely on the posted quotes: when no trade occurs, the learner observes supply–demand information from the limit order book; upon a trade, only the fact of execution is revealed, without knowledge of the counterparty’s valuation. To tackle this setting, the work introduces, for the first time, an action-dependent feedback model into the online market-making framework, enhancing learnability without requiring smoothness assumptions on the valuation distribution. The authors propose a tournament-based algorithm for i.i.d. prices, extend it to handle mean-reverting price processes via autoregressive or global drift conditions, and design an exploration-perturbation strategy for adversarial environments. The approach achieves a high-probability $O(\sqrt{T})$ regret bound in stochastic settings and an expected $O(T^{2/3})$ regret in adversarial ones, significantly outperforming conventional noisy bandit formulations.
📝 Abstract
We study an online market-making problem in which a learner sequentially posts bid and ask prices for a single asset while interacting with traders holding private valuations. Unlike existing online learning formulations that assume fully censored feedback, we introduce an action-dependent feedback model inspired by real limit order books: when a trade occurs, the trader's valuation remains hidden, whereas when no trade occurs, informative feedback about supply and demand is revealed.
We show that this additional information fundamentally changes the learnability of the problem. In the stochastic setting with i.i.d. market prices, we propose an elimination-based algorithm that achieves $O(\sqrt T)$ regret with high probability, without requiring any smoothness assumptions on the distribution of trader valuations. We then extend this result to a broad class of mean-reverting price processes by considering both local, autoregressive dynamics and a weaker global drift condition based on cumulative deviations from the mean. Under either assumption, we establish high-probability $O(\sqrt T)$ regret bounds, relying on a new concentration inequality of independent interest. Finally, in the adversarial setting with oblivious prices, we design an explore-then-perturb algorithm that guarantees $O(T^{2/3})$ regret in expectation.
Our results quantify the value of observing the order book in online market making and demonstrate that even limited, action-dependent feedback can substantially improve regret guarantees compared to standard bandit feedback models.