Improving Generalization in Intent Detection: GRPO with Reward-Based Curriculum Sampling

📅 2025-04-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In task-oriented dialogue, intent detection faces challenges of poor generalization and failure on unseen intents due to the continuous integration of massive new tools. To address this, we propose Group Relative Policy Optimization (GRPO), the first framework integrating reinforcement learning with chain-of-thought (CoT) reasoning to enable interpretable, stepwise inference over complex semantic intents. We further design Reward-driven Curriculum Sampling (RCS), a dynamic sampling strategy that prioritizes hard examples during training to improve convergence speed and generalization robustness. Experimental results under zero-shot and few-shot unseen-intent settings demonstrate that GRPO significantly outperforms supervised fine-tuning baselines: RCS accelerates training convergence and enhances cross-intent generalization, while CoT improves accuracy on intricate intent recognition. Overall, GRPO achieves highly adaptive and robust intent detection with minimal or no labeled data for novel intents.

Technology Category

Application Category

📝 Abstract
Intent detection, a critical component in task-oriented dialogue (TOD) systems, faces significant challenges in adapting to the rapid influx of integrable tools with complex interrelationships. Existing approaches, such as zero-shot reformulations and LLM-based dynamic recognition, struggle with performance degradation when encountering unseen intents, leading to erroneous task routing. To enhance the model's generalization performance on unseen tasks, we employ Reinforcement Learning (RL) combined with a Reward-based Curriculum Sampling (RCS) during Group Relative Policy Optimization (GRPO) training in intent detection tasks. Experiments demonstrate that RL-trained models substantially outperform supervised fine-tuning (SFT) baselines in generalization. Besides, the introduction of the RCS, significantly bolsters the effectiveness of RL in intent detection by focusing the model on challenging cases during training. Moreover, incorporating Chain-of-Thought (COT) processes in RL notably improves generalization in complex intent detection tasks, underscoring the importance of thought in challenging scenarios. This work advances the generalization of intent detection tasks, offering practical insights for deploying adaptable dialogue systems.
Problem

Research questions and friction points this paper is trying to address.

Enhancing intent detection generalization for unseen tasks
Addressing performance degradation with unseen intents in TOD systems
Improving RL-based training with reward-based curriculum sampling
Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning for intent detection
Reward-based Curriculum Sampling in training
Chain-of-Thought processes in RL
🔎 Similar Papers
No similar papers found.
Zihao Feng
Zihao Feng
Harbin Institute of Technology
Natural Language ProcessingLarge Language Model
X
Xiaoxue Wang
Platform and Content Group, Tencent
Ziwei Bai
Ziwei Bai
Beijing University of Posts and Telecommunications
machine comprehension
D
Donghang Su
Platform and Content Group, Tencent
B
Bowen Wu
Platform and Content Group, Tencent
Q
Qun Yu
Platform and Content Group, Tencent
Baoxun Wang
Baoxun Wang
PCG, Tencent
Natural Language ProcessingDeep LearningChat-Bot