Breaking Contextual Inertia: Reinforcement Learning with Single-Turn Anchors for Stable Multi-Turn Interaction

📅 2026-03-05

📈 Citations: 0

✨ Influential: 0

career value

184K/year

🤖 AI Summary

This work addresses the susceptibility of large language models to “contextual inertia” in multi-turn interactions, which hinders their ability to effectively incorporate new information or correct early errors. To mitigate this issue, the authors propose RLSTA, a reinforcement learning approach that leverages the model’s strong single-turn reasoning capability as an internal anchor to generate self-supervised reward signals for multi-turn responses. This mechanism guides the model to perform self-calibrated reasoning grounded in the most recent context, without requiring external verifiers. RLSTA demonstrates robust cross-domain generalization—e.g., from mathematical reasoning to code generation—and significantly outperforms standard fine-tuning and abstention strategies in multi-turn settings, thereby enhancing both reasoning stability and accuracy.

📝 Abstract

While LLMs demonstrate strong reasoning capabilities when provided with full information in a single turn, they exhibit substantial vulnerability in multi-turn interactions. Specifically, when information is revealed incrementally or requires updates, models frequently fail to integrate new constraints, leading to a collapse in performance compared to their single-turn baselines. We term the root cause as \emph{Contextual Inertia}: a phenomenon where models rigidly adhere to previous reasoning traces. Even when users explicitly provide corrections or new data in later turns, the model ignores them, preferring to maintain consistency with its previous (incorrect) reasoning path. To address this, we introduce \textbf{R}einforcement \textbf{L}earning with \textbf{S}ingle-\textbf{T}urn \textbf{A}nchors (\textbf{RLSTA}), a generalizable training approach designed to stabilize multi-turn interaction across diverse scenarios and domains. RLSTA leverages the model's superior single-turn capabilities as stable internal anchors to provide reward signals. By aligning multi-turn responses with these anchors, RLSTA empowers models to break contextual inertia and self-calibrate their reasoning based on the latest information. Experiments show that RLSTA significantly outperforms standard fine-tuning and abstention-based methods. Notably, our method exhibits strong cross-domain generalization (e.g., math to code) and proves effective even without external verifiers, highlighting its potential for general-domain applications.

Problem

Research questions and friction points this paper is trying to address.

Contextual Inertia

Multi-turn Interaction

Large Language Models

Reasoning Consistency

Information Integration

Innovation

Methods, ideas, or system contributions that make the work stand out.

Contextual Inertia

Reinforcement Learning

Single-Turn Anchors