🤖 AI Summary
This work addresses the challenge of efficiently compressing long contexts to accelerate large language model (LLM) inference without relying on dedicated compression modules or additional training, while minimizing information loss. It reveals for the first time that LLMs inherently possess contextual compression capabilities during inference and introduces a novel paradigm—Thought-as-Compressor (TaC)—which eliminates the need for external compressors. The proposed TaC-C variant enables controllable and compact compression by prompting the model to generate reasoning traces as compressed representations, optimized within a reward-driven framework. Evaluated on four long-context question-answering benchmarks, the method significantly outperforms existing approaches, achieving average F1 score improvements of 17.4% and 23.4% at 4× and 8× compression ratios, respectively, along with corresponding Exact Match Score gains of 15.7% and 21.7%.
📝 Abstract
Context compression aims to shorten long context inputs with minimal information loss for LLM inference acceleration. While existing methods have shown promise, they typically rely on complex compression modules or compression-specific training, leaving the intrinsic capabilities of LLMs underexplored. In contrast, this work reveals that a thinking model itself can naturally compress long contexts by organizing task-relevant information. We thus derive Thinking as Compression (TaC), a new compression paradigm that treats thinking itself as compressed context. Without relying on specific dedicated compressor, TaC directly prompts the thinking model to generate thinking traces as the shortened context, already outperforming most representative compression methods. Further, given that raw thinking output may struggle with budget control and shortcut behaviors, we introduce Thinking as Compression Constrained (TaC-C), leveraging a simple reward-driven optimization framework to elicit intrinsic thinking as compact and controllable compressed context. Experiments across four long-context QA benchmarks demonstrate that TaC-C consistently outperforms existing baselines. At 4x and 8x compression ratios, it surpasses the strongest competitor by 17.4% and 23.4% in average F1, and by 15.7% and 21.7% in average Exact Match Score (EM), respectively.