π€ AI Summary
This work addresses the high computational cost of attention mechanisms and key-value caching in long-context reasoning, necessitating efficient soft compression methods. The authors propose ComprExIT, a lightweight framework for soft context compression that operates on frozen hidden states of large language models. ComprExIT decouples compression from self-attention dynamics through an explicit information transfer mechanism, innovatively incorporating both depth-wise and width-wise information propagation: the former mitigates layer-wise representation overwriting, while the latter enables globally coordinated information allocation. This approach integrates token-anchor-based multi-layer selective transfer with a globally optimized slot aggregation mechanism. Evaluated on six question-answering benchmarks, ComprExIT significantly outperforms state-of-the-art methods, achieving substantially improved compression efficacy and robustness with only approximately 1% additional parameters.
π Abstract
Long-context inference with Large Language Models (LLMs) is costly due to quadratic attention and growing key-value caches, motivating context compression. In this work, we study soft context compression, where a long context is condensed into a small set of continuous representations. Existing methods typically re-purpose the LLM itself as a trainable compressor, relying on layer-by-layer self-attention to iteratively aggregate information. We argue that this paradigm suffers from two structural limitations: (i) progressive representation overwriting across layers (ii) uncoordinated allocation of compression capacity across tokens. We propose ComprExIT (Context Compression via Explicit Information Transmission), a lightweight framework that formulates soft compression into a new paradigm: explicit information transmission over frozen LLM hidden states. This decouples compression from the model's internal self-attention dynamics. ComprExIT performs (i) depth-wise transmission to selectively transmit multi-layer information into token anchors, mitigating progressive overwriting, and (ii) width-wise transmission to aggregate anchors into a small number of slots via a globally optimized transmission plan, ensuring coordinated allocation of information. Across six question-answering benchmarks, ComprExIT consistently outperforms state-of-the-art context compression methods while introducing only ~1% additional parameters, demonstrating that explicit and coordinated information transmission enables more effective and robust long-context compression.