🤖 AI Summary
This work addresses the tendency of large language models to fall into repetitive loops during long-context generation due to self-reinforcing attention mechanisms, which often leads to output collapse. The study systematically uncovers, for the first time, a positive feedback loop between KV cache reuse and such repetitive cycles. To mitigate this issue, the authors propose a lightweight, plug-and-play online loop-breaking method that dynamically intervenes in the KV cache by detecting and pruning redundant trailing segments in real time. A dedicated evaluation benchmark, LoopBench, is introduced to assess loop-related behaviors. Experimental results demonstrate that, under a fixed cache budget, the proposed approach reduces loop occurrence by over 90 percentage points, substantially enhancing output diversity and minimizing the generation of redundant tokens.
📝 Abstract
Through systematic experiments on long-context generation, we observe a damaging failure mode in which decoding can collapse into persistent repetition loops. We find that this degeneration is driven by collapsed attention patterns, where a subset of heads locks onto a narrow suffix of the history, and is further stabilized by inference-time KV cache reuse. Crucially, since many existing KV cache policies rely on attention-based importance, this collapse can produce spuriously high scores for repetitive tokens, causing cache management to inadvertently amplify repetition. To study this phenomenon in a controlled and reproducible manner, we introduce LoopBench, a benchmark with explicit loop-inducing conditions and loop-oriented metrics that quantify repetition severity and generation instability beyond downstream task scores. Building on these insights, we propose LoopGuard, a lightweight, plug-in KV cache guard that detects loop onset online and disrupts the feedback cycle by pruning repetitive tail spans under a fixed cache budget. Experiments on LoopBench show that LoopGuard reduces loop incidence by over 90 percentage points, while restoring output diversity and reducing token waste.