Evaluating LLMs Code Reasoning Under Real-World Context

πŸ“… 2026-04-14
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

186K/year
πŸ€– AI Summary
Existing code reasoning benchmarks predominantly rely on simplified or synthetically generated code, which fails to capture the intricate data structures and contextual dependencies prevalent in real-world software projects, thereby limiting the evaluation of large language models’ practical generalization capabilities. To address this gap, this work introduces R2Eval, a novel benchmark that systematically curates 135 high-fidelity code reasoning tasks from ten widely used open-source Python projects. By preserving the complexity of real-world data through serialization of composite and user-defined types, and integrating a context-aware extraction and evaluation framework, R2Eval effectively reconstructs authentic development scenarios. This benchmark substantially enhances the validity of assessing code comprehension and reasoning abilities in large language models, overcoming the limitations of traditional benchmarks that are confined to primitive input-output type mappings.

Technology Category

Application Category

πŸ“ Abstract
Code reasoning tasks are increasingly crucial to evaluating large language models (LLMs). Yet most existing benchmarks rely on simplistic, LLM-generated snippets or human-written solutions to code challenges and often restrict inputs and outputs to primitive types, failing to reflect the structure and dependencies of real-world projects. These simplifications limit their ability to measure practical generalizability. We present R2Eval1, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.
Problem

Research questions and friction points this paper is trying to address.

code reasoning
large language models
real-world context
benchmark
generalizability
Innovation

Methods, ideas, or system contributions that make the work stand out.

code reasoning
real-world context
compound types
LLM evaluation
R2Eval
πŸ”Ž Similar Papers
No similar papers found.