🤖 AI Summary
To address the high computational overhead induced by long contexts in retrieval-augmented generation (RAG), this paper proposes a lightweight soft context compression method. Instead of employing complex, learnable compression modules, it adopts mean pooling to efficiently map input sequences into continuous, dense low-dimensional representations. The approach supports joint training across multiple compression ratios, enabling flexible adaptation to varying context lengths and large language model (LLM) scales. Crucially, it introduces no additional parameters, significantly reducing both computational and memory costs. Extensive evaluation on multiple open-domain question answering benchmarks—across diverse LLM sizes—demonstrates its effectiveness: input sequence length is reduced by up to 8×, with only marginal performance degradation (average drop <1.5%), strong generalization, and seamless deployability.
📝 Abstract
A common strategy to reduce the computational costs of using long contexts in retrieval-augmented generation (RAG) with large language models (LLMs) is soft context compression, where the input sequence is transformed into a shorter continuous representation. We develop a lightweight and simple mean-pooling approach that consistently outperforms the widely used compression-tokens architecture, and study training the same compressor to output multiple compression ratios. We conduct extensive experiments across in-domain and out-of-domain QA datasets, as well as across model families, scales, and compression ratios. Overall, our simple mean-pooling approach achieves the strongest performance, with a relatively small drop when training for multiple compression ratios. More broadly though, across architectures and training regimes the trade-offs are more nuanced, illustrating the complex landscape of compression methods.