🤖 AI Summary
In home environments, long-horizon tasks often lack explicit user preferences, hindering reliable preference-aligned execution. Method: We propose an active questioning paradigm for preference-adaptive task execution and introduce ADAPT—the first benchmark supporting preference identification via active questioning. We further design Reflection-DPO, a training framework that jointly models the three-stage policy (“when to ask,” “what to ask,” and “how to execute”) by integrating reflective reasoning with teacher-student distillation. Contribution/Results: Experiments on ADAPT show a 6.1% absolute improvement in satisfaction rate for unseen user preferences over zero-shot chain-of-thought baselines. This work establishes the first systematic validation of active interactive preference learning, demonstrating both efficacy and scalability in long-horizon task execution under implicit preference settings.
📝 Abstract
Assistive agents should be able to perform under-specified long-horizon tasks while respecting user preferences. We introduce Actively Discovering and Adapting to Preferences for any Task (ADAPT) -- a benchmark designed to evaluate agents' ability to adhere to user preferences across various household tasks through active questioning. Next, we propose Reflection-DPO, a novel training approach for adapting large language models (LLMs) to the task of active questioning. Reflection-DPO finetunes a 'student' LLM to follow the actions of a privileged 'teacher' LLM, and optionally ask a question to gather necessary information to better predict the teacher action. We find that prior approaches that use state-of-the-art LLMs fail to sufficiently follow user preferences in ADAPT due to insufficient questioning and poor adherence to elicited preferences. In contrast, Reflection-DPO achieves a higher rate of satisfying user preferences, outperforming a zero-shot chain-of-thought baseline by 6.1% on unseen users.