🤖 AI Summary
This study addresses the challenge of delayed feedback in revenue management caused by order cancellations or modifications. The authors propose a novel approach that integrates a calibrated discrete choice model as a fixed partial world model within a reinforcement learning framework, enabling imputation of delayed rewards during decision-making and facilitating Q-learning. This work is the first to incorporate structured choice models into delayed feedback settings, providing theoretical guarantees of algorithmic convergence and characterizing the robustness and bias-risk trade-offs of the partial model under distributional shifts. Experiments on a real-world dataset of 61,619 hotel bookings show that the method matches the performance of Buffered DQN in steady state, significantly improves revenue in 5 out of 10 parameter-shift scenarios (by up to 12.4%), yet suffers a 1.4–2.6% revenue loss when the model structure is misspecified.
📝 Abstract
We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-\gamma))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.