🤖 AI Summary
Large language models (LLMs) exhibit significant limitations in deductive code reasoning—i.e., precisely tracking program execution and state evolution—due to generation bias, misalignment between reasoning and execution capabilities, and poor zero-shot generalization. To address these issues, we propose ReMind, a multi-agent framework comprising three specialized agents: (1) the Mutator, which generates semantically equivalent code variants to mitigate source-code bias; (2) the Executor, which performs step-by-step execution and monitors variable states to expose reasoning inconsistencies; and (3) the Inspector, which identifies erroneous reasoning steps and refines control-flow logic. These agents jointly enable dynamic correction and controllable optimization of the reasoning process. Extensive experiments across two code-reasoning benchmarks and five mainstream LLMs demonstrate that ReMind substantially improves deductive reasoning accuracy and exhibits strong zero-shot generalization. To our knowledge, this is the first work to systematically integrate execution-feedback loops into LLM-based code reasoning enhancement.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable progress in code-related tasks. Despite their advancement, empirical evidence reveals that they still struggle with emph{deductive code reasoning}, the ability to reason about the program execution process. While prior studies have recognized this limitation, the underlying causes remain largely underexplored. In this paper, we begin by presenting a comprehensive empirical study that reveals three key challenges undermining deductive code reasoning: (1) an intrinsic gap between generation and reasoning abilities, (2) a consistent bias towards code sources, and (3) weak zero-shot generalization on complex benchmarks. In light of these challenges, we propose exttt{ReMind}, a multi-agent framework composed of exttt{Mutator}, exttt{Executor}, and exttt{Inspector}. The exttt{Mutator} generates code variants to mitigate bias towards code sources, the exttt{Executor} traces variable states step-by-step to expose inconsistency, and the exttt{Inspector} identifies problematic reasoning steps and provides control-flow refinement to bridge the intrinsic reasoning gap. Through their coordinated collaboration, exttt{ReMind} systematically identifies and refines reasoning flaws, achieving outstanding performance and enabling robust zero-shot generalization. Extensive experiments on two benchmarks with five LLMs demonstrate the superior advantages of exttt{ReMind} compared to baseline approaches in deductive code reasoning.