Evaluating Code Reasoning Abilities of Large Language Models Under Real-World Settings

📅 2025-12-16

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing LLM code-reasoning benchmarks focus on simplistic programs, neglecting realistic complexities—including cross-/intra-function dependencies, API invocations, deep nesting, and non-primitive composite types—leading to severe overestimation of model generalization. Method: We propose RE2-Bench, a rigorous evaluation suite comprising 1,101 problems, including 195 drawn directly from real-world projects. It introduces two key innovations: (1) automatic difficulty stratification using nine interpretable, static code-complexity dimensions; and (2) a hybrid static-dynamic analysis framework for serializing composite and user-defined types, overcoming limitations of primitive-only representations. Contribution/Results: Evaluating six state-of-the-art LLMs reveals a dramatic 42.15–51.50 percentage-point accuracy drop on hard problems versus easy ones, exposing substantial optimistic bias in current benchmarking practices and underscoring the necessity of realism-aligned evaluation.

Technology Category

Application Category

📝 Abstract

Code reasoning tasks are becoming prevalent in large language model (LLM) assessments. Existing benchmarks involve simple programs, failing to represent real-world complexities such as inter- or intra-procedural dependencies, core or third-party API calls, highly nested constructs, and non-primitive complex types. Evaluating LLMs under such a simplistic setting poses a significant threat to assumptions about their generalizability in practice. To enable a more realistic evaluation of code reasoning, this paper proposes RE2-Bench, a benchmark of 1,101 reasoning problems, including 195 drawn from mature real-world projects. RE2-Bench leverages static and dynamic program analysis to automatically serialize and deserialize compound, complex, and custom types in real-world code, going far beyond the primitive-only settings used in prior work. A key feature of RE2-Bench is categorizing each reasoning problem as Easy or Hard via a principled majority-vote mechanism over nine interpretable code complexity metrics, resulting in two well-separated and semantically meaningful difficulty categories suitable for precise calibration of LLM reasoning ability. A comprehensive evaluation of six general-purpose and reasoning-oriented LLMs on two widely used code reasoning tasks -- input prediction and output prediction -- using RE2-Bench reveals a significant performance drop from Easy to Hard problems (51.50% for input prediction and 42.15% for output prediction), confirming that prior evaluations substantially overestimate the reasoning capabilities of LLMs.

Problem

Research questions and friction points this paper is trying to address.

Evaluates code reasoning in real-world settings with complex dependencies

Addresses limitations of simple benchmarks lacking realistic program complexities

Measures performance gap between easy and hard code reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Automatically serializes complex real-world code types

Categorizes problems by difficulty using interpretable metrics

Uses static and dynamic program analysis for realistic evaluation

🔎 Similar Papers

Do Large Code Models Understand Programming Concepts? Counterfactual Analysis for Code Predicates