🤖 AI Summary
To address context-length sensitivity in retrieval-augmented multi-document summarization, this paper proposes a novel method for dynamically estimating the optimal retrieval context length. Methodologically, it introduces a panel-driven estimation algorithm leveraging silver references generated by multiple large language models (LLMs), thereby eliminating reliance on static-length benchmarks such as RULER or HELMET. The approach integrates long-context LLMs (e.g., those supporting 128K+ tokens), retrieval-augmented generation (RAG), and explicit modeling of context-length sensitivity to ensure generalizability across diverse model architectures and scales. Experimental results demonstrate substantial improvements in ROUGE scores on multi-document summarization tasks, with consistent gains across both small and large models. Moreover, the method exhibits superior robustness in ultra-long-context scenarios compared to existing context-length estimation techniques.
📝 Abstract
Recent advances in long-context reasoning abilities of language models led to interesting applications in large-scale multi-document summarization. However, prior work has shown that these long-context models are not effective at their claimed context windows. To this end, retrieval-augmented systems provide an efficient and effective alternative. However, their performance can be highly sensitive to the choice of retrieval context length. In this work, we present a hybrid method that combines retrieval-augmented systems with long-context windows supported by recent language models. Our method first estimates the optimal retrieval length as a function of the retriever, summarizer, and dataset. On a randomly sampled subset of the dataset, we use a panel of LLMs to generate a pool of silver references. We use these silver references to estimate the optimal context length for a given RAG system configuration. Our results on the multi-document summarization task showcase the effectiveness of our method across model classes and sizes. We compare against length estimates from strong long-context benchmarks such as RULER and HELMET. Our analysis also highlights the effectiveness of our estimation method for very long-context LMs and its generalization to new classes of LMs.