LoCoBench: A Benchmark for Long-Context Large Language Models in Complex Software Engineering

📅 2025-09-11

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing code benchmarks inadequately evaluate large language models’ capabilities in long-context software development tasks—such as cross-file reasoning and architectural consistency maintenance. Method: We introduce LoCoBench, the first comprehensive software engineering benchmark explicitly designed for long-context LLMs. It spans 10 programming languages and 8 novel task categories—including codebase comprehension and multi-file collaborative editing—and supports context lengths from 10K to 1M tokens. We propose a 17-dimensional LoCoBench Score (LCBS) incorporating eight new evaluation metrics and generate 8,000 high-quality test scenarios via a five-stage automated pipeline integrating cross-file dependency analysis and multi-turn development simulation. Contribution/Results: Empirical evaluation reveals substantial performance gaps of state-of-the-art long-context models on realistic software tasks, uncovering critical challenges—including context fragmentation, dependency tracking, and architectural coherence—and establishing a rigorous foundation for future research.

Technology Category

Application Category

📝 Abstract

The emergence of long-context language models with context windows extending to millions of tokens has created new opportunities for sophisticated code understanding and software development evaluation. We propose LoCoBench, a comprehensive benchmark specifically designed to evaluate long-context LLMs in realistic, complex software development scenarios. Unlike existing code evaluation benchmarks that focus on single-function completion or short-context tasks, LoCoBench addresses the critical evaluation gap for long-context capabilities that require understanding entire codebases, reasoning across multiple files, and maintaining architectural consistency across large-scale software systems. Our benchmark provides 8,000 evaluation scenarios systematically generated across 10 programming languages, with context lengths spanning 10K to 1M tokens, a 100x variation that enables precise assessment of long-context performance degradation in realistic software development settings. LoCoBench introduces 8 task categories that capture essential long-context capabilities: architectural understanding, cross-file refactoring, multi-session development, bug investigation, feature implementation, code comprehension, integration testing, and security analysis. Through a 5-phase pipeline, we create diverse, high-quality scenarios that challenge LLMs to reason about complex codebases at unprecedented scale. We introduce a comprehensive evaluation framework with 17 metrics across 4 dimensions, including 8 new evaluation metrics, combined in a LoCoBench Score (LCBS). Our evaluation of state-of-the-art long-context models reveals substantial performance gaps, demonstrating that long-context understanding in complex software development represents a significant unsolved challenge that demands more attention. LoCoBench is released at: https://github.com/SalesforceAIResearch/LoCoBench.

Problem

Research questions and friction points this paper is trying to address.

Evaluating long-context LLMs in complex software development scenarios

Assessing performance degradation across 10K to 1M token contexts

Measuring capabilities in multi-file reasoning and architectural consistency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive benchmark for long-context LLMs evaluation

Systematically generated 8000 scenarios across 10 languages

Introduces 8 task categories with 17 evaluation metrics

🔎 Similar Papers

No similar papers found.

Authors to Follow