Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

📅 2025-12-16
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing LLM code-reasoning benchmarks focus on simplistic programs, neglecting realistic complexities—including cross-/intra-function dependencies, API invocations, deep nesting, and non-primitive composite types—leading to severe overestimation of model generalization. Method: We propose RE2-Bench, a rigorous evaluation suite comprising 1,101 problems, including 195 drawn directly from real-world projects. It introduces two key innovations: (1) automatic difficulty stratification using nine interpretable, static code-complexity dimensions; and (2) a hybrid static-dynamic analysis framework for serializing composite and user-defined types, overcoming limitations of primitive-only representations. Contribution/Results: Evaluating six state-of-the-art LLMs reveals a dramatic 42.15–51.50 percentage-point accuracy drop on hard problems versus easy ones, exposing substantial optimistic bias in current benchmarking practices and underscoring the necessity of realism-aligned evaluation.

Technology Category

Application Category

📝 Abstract
Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Existing benchmarks involve simple programs, failing to represent real-world complexities such as inter- or intra-procedural dependencies, core or third-party API calls, highly nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, this paper proposes RE2-Bench, a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects. RE2-Bench leverages static and dynamic program analysis to automatically serialize and deserialize compound, complex, and custom types in real-world code, going far beyond the primitive-only settings used in prior work. A key feature of RE2-Bench is categorizing each reasoning problem as Easy or Hard via a principled majority-vote mechanism over nine interpretable code complexity metrics, resulting in two well-separated and semantically meaningful difficulty categories suitable for precise calibration of LLM reasoning ability. A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks -- input prediction and output prediction -- using RE2-Bench reveals a significant performance drop from Easy to Hard problems (51.50% for input prediction and 42.15% for output prediction), confirming that prior evaluations substantially overestimate the reasoning capabilities of LLMs.
Problem

Research questions and friction points this paper is trying to address.

Evaluates code reasoning in real-world settings with complex dependencies
Addresses limitations of simple benchmarks lacking realistic program complexities
Measures performance gap between easy and hard code reasoning tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically serializes complex real-world code types
Categorizes problems by difficulty using interpretable metrics
Uses static and dynamic program analysis for realistic evaluation
🔎 Similar Papers
No similar papers found.