Accordion-Thinking: Self-Regulated Step Summaries for Efficient and Readable LLM Reasoning

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of large language models in long-chain reasoning, where linear growth of KV cache and quadratic complexity of attention mechanisms hinder both efficiency and readability. The authors propose Accordion-Thinking, a novel framework that enables models to dynamically compress critical reasoning information into compact summaries during training. This approach features a dual-mode Fold/Unfold mechanism: the Fold mode periodically discards redundant historical tokens to boost throughput, while the Unfold mode preserves full reasoning traces to ensure interpretability. Integrating an end-to-end reinforcement learning–driven dynamic summarization module with structured reasoning step generation, the method achieves a 3× increase in inference throughput under 48GB GPU memory constraints without sacrificing accuracy, effectively narrowing the performance gap between efficient and human-readable reasoning.

Technology Category

Application Category

📝 Abstract
Scaling test-time compute via long Chain-ofThought unlocks remarkable gains in reasoning capabilities, yet it faces practical limits due to the linear growth of KV cache and quadratic attention complexity. In this paper, we introduce Accordion-Thinking, an end-to-end framework where LLMs learn to self-regulate the granularity of the reasoning steps through dynamic summarization. This mechanism enables a Fold inference mode, where the model periodically summarizes its thought process and discards former thoughts to reduce dependency on historical tokens. We apply reinforcement learning to incentivize this capability further, uncovering a critical insight: the accuracy gap between the highly efficient Fold mode and the exhaustive Unfold mode progressively narrows and eventually vanishes over the course of training. This phenomenon demonstrates that the model learns to encode essential reasoning information into compact summaries, achieving effective compression of the reasoning context. Our Accordion-Thinker demonstrates that with learned self-compression, LLMs can tackle complex reasoning tasks with minimal dependency token overhead without compromising solution quality, and it achieves a 3x throughput while maintaining accuracy on a 48GB GPU memory configuration, while the structured step summaries provide a human-readable account of the reasoning process.
Problem

Research questions and friction points this paper is trying to address.

KV cache
attention complexity
reasoning efficiency
token overhead
long Chain-of-Thought
Innovation

Methods, ideas, or system contributions that make the work stand out.

Accordion-Thinking
self-regulated summarization
Fold inference
reasoning compression
efficient LLM reasoning