Choice-Model-Assisted Q-learning for Delayed-Feedback Revenue Management

📅 2026-02-02

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

This study addresses the challenge of delayed feedback in revenue management caused by order cancellations or modifications. The authors propose a novel approach that integrates a calibrated discrete choice model as a fixed partial world model within a reinforcement learning framework, enabling imputation of delayed rewards during decision-making and facilitating Q-learning. This work is the first to incorporate structured choice models into delayed feedback settings, providing theoretical guarantees of algorithmic convergence and characterizing the robustness and bias-risk trade-offs of the partial model under distributional shifts. Experiments on a real-world dataset of 61,619 hotel bookings show that the method matches the performance of Buffered DQN in steady state, significantly improves revenue in 5 out of 10 parameter-shift scenarios (by up to 12.4%), yet suffers a 1.4–2.6% revenue loss when the model structure is misspecified.

Technology Category

Application Category

📝 Abstract

We study reinforcement learning for revenue management with delayed feedback, where a substantial fraction of value is determined by customer cancellations and modifications observed days after booking. We propose \emph{choice-model-assisted RL}: a calibrated discrete choice model is used as a fixed partial world model to impute the delayed component of the learning target at decision time. In the fixed-model deployment regime, we prove that tabular Q-learning with model-imputed targets converges to an $O(\varepsilon/(1-\gamma))$ neighborhood of the optimal Q-function, where $\varepsilon$ summarizes partial-model error, with an additional $O(t^{-1/2})$ sampling term. Experiments in a simulator calibrated from 61{,}619 hotel bookings (1{,}088 independent runs) show: (i) no statistically detectable difference from a maturity-buffer DQN baseline in stationary settings; (ii) positive effects under in-family parameter shifts, with significant gains in 5 of 10 shift scenarios after Holm--Bonferroni correction (up to 12.4\%); and (iii) consistent degradation under structural misspecification, where the choice model assumptions are violated (1.4--2.6\% lower revenue). These results characterize when partial behavioral models improve robustness under shift and when they introduce harmful bias.

Problem

Research questions and friction points this paper is trying to address.

delayed feedback

revenue management

reinforcement learning

customer cancellations

booking modifications

Innovation

Methods, ideas, or system contributions that make the work stand out.

choice-model-assisted RL

delayed feedback

revenue management