STEPWISE-CODEX-Bench: Evaluating Complex Multi-Function Comprehension and Fine-Grained Execution Reasoning

📅 2025-08-07

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Existing code evaluation benchmarks (e.g., HumanEval, CRUXEVAL) emphasize single-function correctness or low-complexity reasoning, failing to effectively differentiate state-of-the-art large language models. Method: We propose SX-Bench—the first benchmark designed for multi-function collaborative understanding and fine-grained execution reasoning. It introduces (1) multi-subfunction collaboration tasks with explicit control-flow modeling; (2) “computation steps” as atomic execution units to assess deep dynamic execution comprehension; and (3) an automated evaluation generation pipeline integrating program synthesis, symbolic execution, and LLM-assisted verification. Contribution/Results: Evaluated on 20+ mainstream models, SX-Bench reveals substantial limitations in complex reasoning: OpenAI-o3 achieves only 78.37% accuracy on Hard-Reasoning tasks—significantly below its performance on conventional benchmarks—demonstrating a critical gap in current models’ ability to reason over intricate, interdependent program behaviors.

Technology Category

Application Category

📝 Abstract

In recent years, large language models (LLMs) have made significant progress in code intelligence, yet systematically evaluating their code understanding and reasoning abilities remains challenging. Mainstream benchmarks such as HumanEval and MBPP primarily assess functional correctness, while reasoning benchmarks like CRUXEVAL are limited to single-function, low-complexity scenarios. As a result, advanced models achieve nearly saturated scores, limiting their discriminative power. To address this, we present STEPWISE-CODEX-Bench (SX-Bench), a novel benchmark designed for complex multi-function understanding and fine-grained execution reasoning. SX-Bench features tasks involving collaboration among multiple sub-functions (e.g., chained calls, nested loops), shifting evaluation towards overall control and data flow modeling. It defines "computation steps" as the minimal execution unit and requires models to predict the total number of steps in reasoning tasks, thereby assessing a model's in-depth understanding of dynamic execution beyond simple I/O matching. Evaluation on over 20 mainstream models (including 14 reasoning-enhanced models) demonstrates that SX-Bench is highly discriminative: even the state-of-the-art OpenAI-O3 achieves only 78.37 percent accuracy on Hard-Reasoning tasks, much lower than its saturated scores on previous benchmarks, thereby revealing bottlenecks in complex and fine-grained reasoning. We also release an automated pipeline combining program synthesis, symbolic execution, and LLM-aided validation for efficient benchmark generation and quality assurance. SX-Bench advances code evaluation from "single-function verification" to "multi-function dynamic reasoning," providing a key tool for the in-depth assessment of advanced code intelligence models.

Problem

Research questions and friction points this paper is trying to address.

Evaluating complex multi-function code understanding in LLMs

Assessing fine-grained execution reasoning beyond I/O matching

Addressing limitations of current benchmarks in discriminative power

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multi-function understanding and fine-grained execution reasoning

Defines computation steps as minimal execution unit

Automated pipeline for benchmark generation and validation

🔎 Similar Papers

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions