๐ค AI Summary
This work addresses the significant performance degradation of multi-turn dialogue systems in deployment, caused by contextual distribution shifts between training and real-world interactionsโa gap that accumulates quadratically with conversation length. The study is the first to systematically distinguish between policy-induced and simulator-induced distribution shifts and proposes a calibration-aware interactive reinforcement learning framework. By incorporating a simulator alignment mechanism that better approximates human interaction patterns, the approach jointly mitigates both types of shift. Integrating interactive reinforcement learning with LLM-based policy optimization, the method substantially outperforms static-context baselines across multiple dialogue tasks. Furthermore, simulator calibration effectively narrows the reality gap between simulation and actual deployment, achieving state-of-the-art downstream performance.
๐ Abstract
A long-standing goal of the research community is to develop highly interactive LLM-based dialogue agents. Recent research focuses on optimizing policies based on fixed offline logs (Static Context RL) or using a prompt-based simulator (Interactive RL). In this work, we theoretically show that both paradigms are fundamentally limited by context distribution shift--a mismatch between dialogue histories observed during training and those encountered in real conversations. This shift compounds quadratically over turns and severely degrades dialogue quality. Specifically, we attribute this shift to two distinct sources: (i) policy-induced shift, arising from training on static histories rather than self-generated trajectories; and (ii) simulator-induced shift, stemming from discrepancies between simulated and real human behaviors. To address these challenges, we propose Calibrated Interactive RL, a unified framework that couples interactive RL with simulator alignment. By aligning the simulator with human interaction patterns, our approach reduces the sim-to-real gap and mitigates compounding distribution shifts. Experiments across multiple dialogue tasks confirm our theoretical analysis: (i) Interactive RL significantly outperforms the Static Context baseline by mitigating policy distribution shift; and (ii) calibrating simulators with our alignment method further bridges the sim-to-real gap, yielding state-of-the-art downstream performance.