Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training

📅 2024-05-31
🏛️ International Conference on Learning Representations
📈 Citations: 11
Influential: 0
📄 PDF
🤖 AI Summary
Current large language models (LLMs) lack proactive ambiguity clarification capabilities in multi-turn dialogues, often evading explicit clarification questions or implicitly guessing user intent; moreover, high-quality dialogue samples for training such strategies are scarce. This work proposes Action-driven Contrastive Self-Training (ACT), which formalizes ambiguity identification as a learnable, explicit dialogue action to enable sample-efficient policy optimization. We introduce AmbigSQL—a novel benchmark task designed to evaluate implicit ambiguity reasoning in SQL generation—and integrate quasi-online preference optimization via Direct Preference Optimization (DPO), multi-turn action modeling, and joint grounding across tabular, textual, and SQL modalities. Experiments demonstrate that ACT significantly outperforms supervised fine-tuning and standard DPO on Tabular QA, machine reading comprehension, and AmbigSQL, achieving substantial gains in clarification accuracy and task completion rate—even under extremely low annotation budgets.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) aligned through reinforcement learning from human feedback (RLHF) have quickly become one of the dominant paradigms for building intelligent conversational assistant agents. However, despite their strong performance across many benchmarks, LLM-based agents still lack conversational skills such as disambiguation: when generalized assistants are faced with ambiguity, they often overhedge or implicitly guess users' ground-truth intents rather than asking clarification questions, and under task-specific settings, high-quality conversation samples are often limited, affecting models' ability to learn optimal dialogue action policies. We propose Action-Based Contrastive Self-Training (henceforth ACT), a quasi-online preference optimization algorithm based on Direct Preference Optimization (DPO) which allows for sample-efficient dialogue policy learning in multi-turn conversation. We demonstrate ACT's efficacy under sample-efficient conditions in three difficult conversational tasks: tabular-grounded question-answering, machine reading comprehension, and AmbigSQL, a novel task for disambiguating information-seeking requests for text-to-SQL generation. Additionally, we propose evaluating LLMs' ability to function as conversational agents by examining whether they can implicitly recognize and reason about ambiguity in conversation. ACT demonstrates substantial conversation modeling improvements over standard approaches to supervised fine-tuning and DPO.
Problem

Research questions and friction points this paper is trying to address.

LLMs lack disambiguation skills in conversations
Limited high-quality samples hinder dialogue policy learning
Need for implicit ambiguity recognition in conversational agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Action-Based Contrastive Self-Training for dialogue
Quasi-online preference optimization algorithm
Data-efficient learning without action labels
🔎 Similar Papers
No similar papers found.