Bridging Reasoning and Action: Hybrid LLM-RL Framework for Efficient Cross-Domain Task-Oriented Dialogue

📅 2026-04-25

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the challenge of cross-domain task-oriented dialogue, which requires reasoning over both explicit and implicit feasibility constraints across long-horizon, multi-turn interactions. While large language models (LLMs) are prone to hallucination in long-range reasoning and reinforcement learning (RL) struggles to directly extract structured constraints from raw dialogue, this paper proposes VLK-RL, a hybrid framework that first leverages LLMs to generate candidate constraints and then employs a novel dual-role cross-validation mechanism to filter inconsistent or unreliable outputs. The verified constraints are subsequently mapped into ontology-aligned, structured state representations to guide RL policy optimization. This approach achieves, for the first time, reliable validation and symbolic transformation of LLM-generated constraints, effectively bridging symbolic reasoning with behavioral decision-making and significantly improving generalization and robustness across multiple benchmarks—particularly outperforming strong single-model baselines in long-horizon tasks.

Technology Category

Application Category

📝 Abstract

Cross-domain task-oriented dialogue requires reasoning over implicit and explicit feasibility constraints while planning long-horizon, multi-turn actions. Large language models (LLMs) can infer such constraints but are unreliable over long horizons, while Reinforcement learning (RL) optimizes long-horizon behavior yet cannot recover constraints from raw dialogue. Naively coupling LLMs with RL is therefore brittle: unverified or unstructured LLM outputs can corrupt state representations and misguide policy learning. Motivated by this, we propose Verified LLM-Knowledge empowered RL (VLK-RL), a hybrid framework that makes LLM-derived constraint reasoning usable for RL. VLK-RL first elicits candidate constraints with an LLM and then verifies them via a dual-role cross-examination procedure to suppress hallucinations and cross-turn inconsistencies. The verified constraints are mapped into ontology-aligned slot-value representations, yielding a structured, constraint-aware state for RL policy optimization. Experiments across multiple benchmarks demonstrate that VLK-RL significantly improves generalization and robustness, outperforming strong single-model baselines on long-horizon tasks.

Problem

Research questions and friction points this paper is trying to address.

task-oriented dialogue

cross-domain

reasoning

feasibility constraints

long-horizon planning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hybrid LLM-RL

Constraint Verification

Cross-Domain Task-Oriented Dialogue