๐ค AI Summary
Existing program understanding benchmarks are largely confined to unidirectional input-to-behavior prediction, limiting their ability to comprehensively evaluate large language modelsโ dynamic and causal comprehension of program execution and rendering them susceptible to data contamination. To address this, this work proposes a dual-reasoning evaluation paradigm that jointly assesses a modelโs capacity for causal modeling of execution flow through forward reasoning (predicting behavior from given inputs) and backward reasoning (inferring input modifications required to achieve a target behavior). Based on this framework, we introduce DexBench, a benchmark comprising 445 paired instances, and evaluate 13 prominent large language models. Our results demonstrate that this dual-path reasoning approach provides a more robust and effective means of differentiating modelsโ dynamic code understanding capabilities.
๐ Abstract
Large language models (LLMs) have shown remarkable capabilities across diverse coding tasks. However, their adoption requires a true understanding of program execution rather than relying on surface-level patterns. Existing benchmarks primarily focus on predicting program properties tied to specific inputs (e.g., code coverage, program outputs). As a result, they provide a narrow view of dynamic code reasoning and are prone to data contamination. We argue that understanding program execution requires evaluating its inherent duality through two complementary reasoning tasks: (i) predicting a program's observed behavior for a given input, and (ii) inferring how the input must be mutated toward a specific behavioral objective. Both tasks jointly probe a model's causal understanding of execution flow. We instantiate this duality in DexBench, a benchmark comprising 445 paired instances, and evaluate 13 LLMs. Our results demonstrate that dual-path reasoning provides a robust and discriminative proxy for dynamic code understanding.