🤖 AI Summary
To address high latency and excessive cognitive load on large language models (LLMs) induced by long-context retrieval-augmented generation (RAG) in multi-hop question answering, this paper proposes a general-purpose, lightweight abstraction-based compression method. Our approach employs a distilled model trained exclusively on short contexts yet capable of processing ultra-long inputs (>10k tokens), enabling the first successful short-to-long context transfer learning. It integrates few-shot distillation with end-to-end multi-hop QA optimization and supports user-controllable summary length. Evaluated on four open-domain multi-hop QA benchmarks, our method achieves a 32× context compression ratio when deployed with a 70B LLM, improves average accuracy by 4.67%, and incurs only 23% of the computational overhead of LongLLMLingua. The framework significantly enhances both inference efficiency and cross-context generalization capability.
📝 Abstract
As retrieval-augmented generation (RAG) tackles complex tasks, increasingly expanded contexts offer richer information, but at the cost of higher latency and increased cognitive load on the model. To mitigate this bottleneck, especially for intricate multi-hop questions, we introduce BRIEF-Pro. It is a universal, lightweight compressor that distills relevant evidence for a given query from retrieved documents into a concise summary for seamless integration into in-context RAG. Using seed data consisting of relatively short contexts (fewer than 1k words), BRIEF-Pro is trained to perform abstractive compression of extended contexts exceeding 10k words across a wide range of scenarios. Furthermore, BRIEF-Pro offers flexible user control over summary length by allowing users to specify the desired number of sentences. Experiments on four open-domain multi-hop question-answering datasets show that BRIEF-Pro generates more concise and relevant summaries, enhancing performance across small, large, and proprietary language models. With the 70B reader model, 32x compression by BRIEF-Pro improves QA performance by 4.67% on average over LongLLMLingua's 9x, while requiring only 23% of its computational overhead.