🤖 AI Summary
Current automated program repair (APR) methods over-rely on static analysis while neglecting dynamic runtime behavior, limiting their ability to guide large language models (LLMs) toward accurate fixes.
Method: This paper presents the first systematic investigation into how program execution traces enhance LLM-based repair. We propose a trajectory-injection prompting strategy that structurally incorporates dynamic execution information into LLM inputs while maintaining可控 computational complexity.
Contribution/Results: Extensive evaluation across six dataset–model combinations—including controlled ablation studies and probing analyses—demonstrates the efficacy boundary of execution traces: significant accuracy improvements in two configurations; consistent superiority over trajectory-free baselines and lightweight fine-tuning approaches. Our core contribution is establishing execution traces as a novel, effective signal for LLM-based program understanding, thereby enabling a scalable, dynamically aware APR paradigm.
📝 Abstract
Large Language Models (LLMs) show promising performance on various programming tasks, including Automatic Program Repair (APR). However, most approaches to LLM-based APR are limited to the static analysis of the programs, while disregarding their runtime behavior. Inspired by knowledge-augmented NLP, in this work, we aim to remedy this potential blind spot by augmenting standard APR prompts with program execution traces. We evaluate our approach using the GPT family of models on three popular APR datasets. Our findings suggest that simply incorporating execution traces into the prompt provides a limited performance improvement over trace-free baselines, in only 2 out of 6 tested dataset / model configurations. We further find that the effectiveness of execution traces for APR diminishes as their complexity increases. We explore several strategies for leveraging traces in prompts and demonstrate that LLM-optimized prompts help outperform trace-free prompts more consistently. Additionally, we show trace-based prompting to be superior to finetuning a smaller LLM on a small-scale dataset; and conduct probing studies reinforcing the notion that execution traces can complement the reasoning abilities of the LLMs.