🤖 AI Summary
This work addresses the challenge of cross-domain task-oriented dialogue, which requires reasoning over both explicit and implicit feasibility constraints across long-horizon, multi-turn interactions. While large language models (LLMs) are prone to hallucination in long-range reasoning and reinforcement learning (RL) struggles to directly extract structured constraints from raw dialogue, this paper proposes VLK-RL, a hybrid framework that first leverages LLMs to generate candidate constraints and then employs a novel dual-role cross-validation mechanism to filter inconsistent or unreliable outputs. The verified constraints are subsequently mapped into ontology-aligned, structured state representations to guide RL policy optimization. This approach achieves, for the first time, reliable validation and symbolic transformation of LLM-generated constraints, effectively bridging symbolic reasoning with behavioral decision-making and significantly improving generalization and robustness across multiple benchmarks—particularly outperforming strong single-model baselines in long-horizon tasks.
📝 Abstract
Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.