LongCodeBench: Evaluating Coding LLMs at 1M Context Windows

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of evaluating large language models (LLMs) on code understanding and repair tasks under million-token contexts. To this end, we introduce LongCodeBench—the first benchmark specifically designed for ultra-long-context evaluation. It comprises two realistic, GitHub issue-driven tasks: LongCodeQA and LongSWE-Bench, which uniquely incorporate cross-file bug fixing—authentic software engineering scenarios—into million-token-scale assessment. We propose a hierarchical complexity design and a multi-granularity context slicing strategy coupled with annotation consistency verification, ensuring compatibility with mainstream long-context LLMs (LCLMs). Experimental results reveal severe performance degradation across all current models (e.g., Claude 3.5 Sonnet drops from 29% to 3% accuracy; Qwen2.5 falls from 70.2% to 40%), confirming million-token contexts as a critical bottleneck. LongCodeBench establishes a reproducible, extensible evaluation paradigm for future research.

Technology Category

Application Category

📝 Abstract
Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.
Problem

Research questions and friction points this paper is trying to address.

Evaluating coding LLMs at 1M context windows
Constructing realistic long-context benchmarks for code tasks
Assessing model performance drops in extreme context scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

LongCodeBench tests LLMs at 1M context windows
Uses real-world GitHub issues for QA and bug fixing
Evaluates models across different scales and complexities
🔎 Similar Papers
No similar papers found.