AcademicEval: Live Long-Context LLM Benchmark

📅 2025-10-20

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing long-context LLM benchmarks suffer from fixed-length constraints, high manual annotation costs, and label leakage during training. To address these limitations, we propose ArXivBench—the first dynamically evolving evaluation benchmark for long-context reasoning. ArXivBench automatically constructs multi-level academic writing tasks from arXiv papers and leverages co-authorship graphs to select high-quality few-shot examples, enabling fully automated annotation and eliminating label leakage. It further incorporates flexible context-length adaptation and real-time dataset updates. Extensive experiments reveal that state-of-the-art LLMs exhibit significant deficiencies in long-context few-shot inference and hierarchical abstraction tasks, confirming the benchmark’s rigor and highlighting critical bottlenecks—such as cross-segment coherence degradation and hierarchical information retention—in long-context modeling. Our findings provide concrete guidance for future architectural and training improvements.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have recently achieved remarkable performance in long-context understanding. However, current long-context LLM benchmarks are limited by rigid context length, labor-intensive annotation, and the pressing challenge of label leakage issues during LLM training. Therefore, we propose extsc{AcademicEval}, a live benchmark for evaluating LLMs over long-context generation tasks. extsc{AcademicEval} adopts papers on arXiv to introduce several academic writing tasks with long-context inputs, extit{i.e.}, extsc{Title}, extsc{Abstract}, extsc{Introduction}, and extsc{Related Work}, which cover a wide range of abstraction levels and require no manual labeling. Moreover, extsc{AcademicEval} integrates high-quality and expert-curated few-shot demonstrations from a collected co-author graph to enable flexible context length. Especially, extsc{AcademicEval} features an efficient live evaluation, ensuring no label leakage. We conduct a holistic evaluation on extsc{AcademicEval}, and the results illustrate that LLMs perform poorly on tasks with hierarchical abstraction levels and tend to struggle with long few-shot demonstrations, highlighting the challenge of our benchmark. Through experimental analysis, we also reveal some insights for enhancing LLMs' long-context modeling capabilities. Code is available at https://github.com/ulab-uiuc/AcademicEval

Problem

Research questions and friction points this paper is trying to address.

Addressing rigid context length limitations in LLM benchmarks

Solving labor-intensive annotation and label leakage issues

Evaluating hierarchical abstraction tasks with flexible context lengths

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses arXiv papers for academic writing tasks

Integrates few-shot demonstrations from co-author graph

Implements live evaluation to prevent label leakage

🔎 Similar Papers

No similar papers found.