The Path Not Taken: Duality in Reasoning about Program Execution

๐Ÿ“… 2026-04-21
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

197K/year
๐Ÿค– AI Summary
Existing program understanding benchmarks are largely confined to unidirectional input-to-behavior prediction, limiting their ability to comprehensively evaluate large language modelsโ€™ dynamic and causal comprehension of program execution and rendering them susceptible to data contamination. To address this, this work proposes a dual-reasoning evaluation paradigm that jointly assesses a modelโ€™s capacity for causal modeling of execution flow through forward reasoning (predicting behavior from given inputs) and backward reasoning (inferring input modifications required to achieve a target behavior). Based on this framework, we introduce DexBench, a benchmark comprising 445 paired instances, and evaluate 13 prominent large language models. Our results demonstrate that this dual-path reasoning approach provides a more robust and effective means of differentiating modelsโ€™ dynamic code understanding capabilities.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.
Problem

Research questions and friction points this paper is trying to address.

program execution
dynamic code reasoning
dual-path reasoning
benchmarking
large language models
Innovation

Methods, ideas, or system contributions that make the work stand out.

dual-path reasoning
program execution understanding
dynamic code reasoning
input mutation
DexBench
๐Ÿ”Ž Similar Papers