Evaluating LLMs Code Reasoning Under Real-World Context

📅 2026-04-14

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Existing code reasoning benchmarks predominantly rely on simplified or synthetically generated code, which fails to capture the intricate data structures and contextual dependencies prevalent in real-world software projects, thereby limiting the evaluation of large language models’ practical generalization capabilities. To address this gap, this work introduces R2Eval, a novel benchmark that systematically curates 135 high-fidelity code reasoning tasks from ten widely used open-source Python projects. By preserving the complexity of real-world data through serialization of composite and user-defined types, and integrating a context-aware extraction and evaluation framework, R2Eval effectively reconstructs authentic development scenarios. This benchmark substantially enhances the validity of assessing code comprehension and reasoning abilities in large language models, overcoming the limitations of traditional benchmarks that are confined to primitive input-output type mappings.

Technology Category

Application Category

📝 Abstract

Code reasoning tasks are increasingly crucial to evaluating large language models (LLMs). Yet most existing benchmarks rely on simplistic, LLM-generated snippets or human-written solutions to code challenges and often restrict inputs and outputs to primitive types, failing to reflect the structure and dependencies of real-world projects. These simplifications limit their ability to measure practical generalizability. We present R2Eval1, a benchmark of 135 code reasoning problems drawn from ten widely used Python projects. Unlike prior work, R2Eval serializes compound and custom types, preserving real-world data complexity and enabling a more realistic assessment of LLMs.

Problem

Research questions and friction points this paper is trying to address.

code reasoning

large language models

real-world context

benchmark

generalizability

Innovation

Methods, ideas, or system contributions that make the work stand out.

code reasoning

real-world context

compound types