🤖 AI Summary
Existing long-context LLM benchmarks suffer from fixed-length constraints, high manual annotation costs, and label leakage during training. To address these limitations, we propose ArXivBench—the first dynamically evolving evaluation benchmark for long-context reasoning. ArXivBench automatically constructs multi-level academic writing tasks from arXiv papers and leverages co-authorship graphs to select high-quality few-shot examples, enabling fully automated annotation and eliminating label leakage. It further incorporates flexible context-length adaptation and real-time dataset updates. Extensive experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in long-context few-shot inference and hierarchical abstraction tasks, confirming the benchmark’s rigor and highlighting critical bottlenecks—such as cross-segment coherence degradation and hierarchical information retention—in long-context modeling. Our findings provide concrete guidance for future architectural and training improvements.
📝 Abstract
Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose extsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. extsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, extit{i.e.}, extsc{Title}, extsc{Abstract}, extsc{Introduction}, and extsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, extsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, extsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on extsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval