🤖 AI Summary
This work addresses the challenge of evaluating large language models (LLMs) on code understanding and repair tasks under million-token contexts. To this end, we introduce LongCodeBench—the first benchmark specifically designed for ultra-long-context evaluation. It comprises two realistic, GitHub issue-driven tasks: LongCodeQA and LongSWE-Bench, which uniquely incorporate cross-file bug fixing—authentic software engineering scenarios—into million-token-scale assessment. We propose a hierarchical complexity design and a multi-granularity context slicing strategy coupled with annotation consistency verification, ensuring compatibility with mainstream long-context LLMs (LCLMs). Experimental results reveal severe performance degradation across all current models (e.g., Claude 3.5 Sonnet drops from 29% to 3% accuracy; Qwen2.5 falls from 70.2% to 40%), confirming million-token contexts as a critical bottleneck. LongCodeBench establishes a reproducible, extensible evaluation paradigm for future research.
📝 Abstract
Context lengths for models have grown rapidly, from thousands to millions of tokens in just a few years. The extreme context sizes of modern long-context models have made it difficult to construct realistic long-context benchmarks -- not only due to the cost of collecting million-context tasks but also in identifying realistic scenarios that require significant contexts. We identify code comprehension and repair as a natural testbed and challenge task for long-context models and introduce LongCodeBench (LCB), a benchmark to test LLM coding abilities in long-context scenarios. Our benchmark tests both the comprehension and repair capabilities of LCLMs in realistic and important settings by drawing from real-world GitHub issues and constructing QA (LongCodeQA) and bug fixing (LongSWE-Bench) tasks. We carefully stratify the complexity of our benchmark, enabling us to evaluate models across different scales -- ranging from Qwen2.5 14B Instruct to Google's flagship Gemini model. We find that long-context remains a weakness for all models, with performance drops such as from 29% to 3% for Claude 3.5 Sonnet, or from 70.2% to 40% for Qwen2.5.