Controllable User Simulation

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This work addresses the limitations of existing controllable user simulators, which suffer from lookahead bias introduced during supervised fine-tuning, leading to causal inconsistency and drastically increased evaluation variance that undermines their ability to effectively test novel dialogue policies. The authors formalize controllable simulation as a causal inference problem, identify a “controllability collapse” phenomenon inherent in standard approaches, and establish theoretical conditions necessary to restore causal consistency. To this end, they propose innovative training strategies—including prior-based control, stepwise dynamic control, and policy-conditioned learning—that eliminate lookahead bias while preserving naturalness and behavioral diversity in simulated dialogues. Empirical results demonstrate that the proposed method significantly enhances zero-shot generalization to unseen agents without compromising dialogue quality.

📝 Abstract

Using offline datasets to evaluate conversational agents often fails to cover rare scenarios or to support testing new policies. This has motivated the use of controllable user simulators for targeted, counterfactual evaluation, typically implemented by prompting or fine-tuning large language models. In this work, we formalize controllable simulation as a causal inference problem. By bridging natural language evaluation with off-policy evaluation methodology, we show that the standard practice of training simulators via supervised fine-tuning on post-hoc trajectory labels yields a structurally biased model. Specifically, these labels are inextricably coupled to the data-generating behavior policy, injecting a look-ahead bias that breaks causal consistency. Furthermore, we prove that under policy shift this failure causes the variance of evaluation metrics to explode geometrically, a phenomenon we term controllability collapse. To restore causal consistency, we establish theoretical conditions for accurate simulation and propose practical training mitigations: a priori controls, step-wise dynamic controls, and direct policy-conditioned learning. Empirical evaluation confirms that while standard global controls distort conversational distributions and collapse behavioral diversity, our causally grounded simulators eliminate look-ahead bias, preserve natural variance, and exhibit robust zero-shot generalization to unseen agent behaviors.

Problem

Research questions and friction points this paper is trying to address.

controllable user simulation

causal inference

look-ahead bias

off-policy evaluation

policy shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

controllable user simulation

causal inference

off-policy evaluation