The Path Not Taken: Duality in Reasoning about Program Execution

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

189K/year

🤖 AI Summary

Existing program understanding benchmarks are largely confined to unidirectional input-to-behavior prediction, limiting their ability to comprehensively evaluate large language models’ dynamic and causal comprehension of program execution and rendering them susceptible to data contamination. To address this, this work proposes a dual-reasoning evaluation paradigm that jointly assesses a model’s capacity for causal modeling of execution flow through forward reasoning (predicting behavior from given inputs) and backward reasoning (inferring input modifications required to achieve a target behavior). Based on this framework, we introduce DexBench, a benchmark comprising 445 paired instances, and evaluate 13 prominent large language models. Our results demonstrate that this dual-path reasoning approach provides a more robust and effective means of differentiating models’ dynamic code understanding capabilities.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.

Problem

Research questions and friction points this paper is trying to address.

program execution

dynamic code reasoning

dual-path reasoning

benchmarking

large language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-path reasoning

program execution understanding

dynamic code reasoning