ICPO: Illocution-Calibrated Policy Optimization for Multi-Turn Conversation

📅 2026-01-20

📈 Citations: 1

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This work addresses the challenge of "dialogue drift" in large language models during multi-turn conversations, where ambiguous initial instructions lead to persistent erroneous assumptions that are difficult to correct. To mitigate this, the paper proposes a novel reinforcement learning framework that, for the first time, incorporates pragmatic awareness into the reward mechanism. By detecting instruction ambiguity, the framework dynamically modulates reward signals to encourage the model to express epistemic humility or proactively seek clarification under uncertainty. This approach combines verifiable rewards with data augmentation using ambiguous prompts to enable fine-grained control over response style. Experimental results demonstrate that the proposed method improves average performance by 75% on multi-turn dialogue tasks while maintaining robustness on single-turn benchmarks, significantly enhancing both the cooperativeness and robustness of conversational agents.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) in multi-turn conversations often suffer from a ``lost-in-conversation''phenomenon, where they struggle to recover from early incorrect assumptions, particularly when users provide ambiguous initial instructions. We find that standard post-training techniques like Reinforcement Learning with Verifiable Rewards (RLVR) exacerbate this issue by rewarding confident, direct answers, thereby inducing overconfidence and discouraging the model from seeking clarification. To address this, we propose Illocution-Calibrated Policy Optimization (ICPO), a novel training framework that sensitizes the model to instruction ambiguity. ICPO augments the training corpus with underspecified prompts and conditions the reward signal on the user's illocutionary intent, rewarding the model for expressing uncertainty or asking for clarification when faced with ambiguity. Experiments demonstrate that ICPO fosters appropriate humility, yielding a substantial average improvement of 75\% in multi-turn conversation, while preserving robust performance on single-turn benchmarks. Our work presents a practical path toward more robust and collaborative conversational AI that can better navigate the nuances of human interaction.

Problem

Research questions and friction points this paper is trying to address.

lost-in-conversation

instruction ambiguity

multi-turn conversation

overconfidence

clarification

Innovation

Methods, ideas, or system contributions that make the work stand out.

Illocution-Calibrated Policy Optimization

multi-turn conversation

instruction ambiguity