AcademicEval: Live Long-Context LLM Benchmark

📅 2025-10-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing long-context LLM benchmarks suffer from fixed-length constraints, high manual annotation costs, and label leakage during training. To address these limitations, we propose ArXivBench—the first dynamically evolving evaluation benchmark for long-context reasoning. ArXivBench automatically constructs multi-level academic writing tasks from arXiv papers and leverages co-authorship graphs to select high-quality few-shot examples, enabling fully automated annotation and eliminating label leakage. It further incorporates flexible context-length adaptation and real-time dataset updates. Extensive experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in long-context few-shot inference and hierarchical abstraction tasks, confirming the benchmark’s rigor and highlighting critical bottlenecks—such as cross-segment coherence degradation and hierarchical information retention—in long-context modeling. Our findings provide concrete guidance for future architectural and training improvements.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose extsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. extsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, extit{i.e.}, extsc{Title}, extsc{Abstract}, extsc{Introduction}, and extsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, extsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, extsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on extsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval
Problem

Research questions and friction points this paper is trying to address.

Addressing rigid context length limitations in LLM benchmarks
Solving labor-intensive annotation and label leakage issues
Evaluating hierarchical abstraction tasks with flexible context lengths
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses arXiv papers for academic writing tasks
Integrates few-shot demonstrations from co-author graph
Implements live evaluation to prevent label leakage
🔎 Similar Papers
No similar papers found.