Asking Forever: Universal Activations Behind Turn Amplification in Conversational LLMs

📅 2026-02-19

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

This work identifies and addresses the “turn amplification” problem in conversational large language models, wherein models unnecessarily prolong dialogues when failing to complete a task, substantially increasing operational costs. The study is the first to discover and characterize a query-agnostic, general activation subspace associated with clarification-seeking behavior. Building on this insight, the authors propose a novel, scalable attack paradigm: leveraging mechanistic interpretability to pinpoint critical subspaces, then inducing compliant yet excessively verbose dialogues across diverse tasks and prompts via supply-chain fine-tuning or low-level parameter perturbations. This approach effectively bypasses existing defenses and exposes a critical security blind spot in current systems concerning dialogue dynamics.

Technology Category

Application Category

📝 Abstract

Multi-turn interaction length is a dominant factor in the operational costs of conversational LLMs. In this work, we present a new failure mode in conversational LLMs: turn amplification, in which a model consistently prolongs multi-turn interactions without completing the underlying task. We show that an adversary can systematically exploit clarification-seeking behavior$-$commonly encouraged in multi-turn conversation settings$-$to scalably prolong interactions. Moving beyond prompt-level behaviors, we take a mechanistic perspective and identify a query-independent, universal activation subspace associated with clarification-seeking responses. Unlike prior cost-amplification attacks that rely on per-turn prompt optimization, our attack arises from conversational dynamics and persists across prompts and tasks. We show that this mechanism provides a scalable pathway to induce turn amplification: both supply-chain attacks via fine-tuning and runtime attacks through low-level parameter corruptions consistently shift models toward abstract, clarification-seeking behavior across prompts. Across multiple instruction-tuned LLMs and benchmarks, our attack substantially increases turn count while remaining compliant. We also show that existing defenses offer limited protection against this emerging class of failures.

Problem

Research questions and friction points this paper is trying to address.

turn amplification

conversational LLMs

clarification-seeking behavior

operational costs

failure mode

Innovation

Methods, ideas, or system contributions that make the work stand out.

turn amplification

universal activation subspace

clarification-seeking behavior