Specification-Guided Repair of Arithmetic Errors in Dafny Programs using LLMs

📅 2025-07-04

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

Debugging arithmetic errors in Dafny programs after formal verification failures remains challenging due to the lack of automated, specification-aware repair techniques. Method: This paper introduces the first formal-specification-driven automated repair approach for Dafny. It leverages Hoare logic to derive program states and localize defects, while using Dafny’s formal specifications as oracles to guide large language models (LLMs)—including GPT-4o mini, Llama 3, Mistral 7B, and Llemma 7B—to generate semantically correct repairs. Crucially, it eliminates reliance on test suites and establishes the first specification–LLM co-repair paradigm. Contribution/Results: Evaluated on the DafnyBench benchmark, the method achieves an 89.6% defect localization accuracy and a 74.18% repair success rate with GPT-4o mini—marking substantial improvements in both efficiency and reliability of formal program debugging.

Technology Category

Application Category

📝 Abstract

Formal verification offers strong assurances of software correctness. However, debugging and repairing the underlying faults can be complex and time-consuming when verification fails. Automated Program Repair (APR) aims to ease this by automatically identifying and fixing faults. Traditional APR techniques often depend on test suites for validation, but these may fail to capture all scenarios. In contrast, formal specifications provide stronger correctness criteria for effective repairs. We present an innovative APR tool for Dafny, a verification-aware programming language that uses formal specifications - including pre-conditions, post-conditions, and invariants - as oracles for fault localization and repair. Assuming the correctness of the specifications and focusing on arithmetic bugs, we localize faults through a series of steps, which include using Hoare Logic to determine the state of each statement within the program and state-of-the-art Large Language Models (LLMs) to synthesize candidate fixes. The chosen models were GPT-4o mini, Llama 3, Mistral 7B, and Llemma 7B. We evaluate our approach using DafnyBench, a benchmark of real-world Dafny programs. Our tool achieves 89.6% accuracy in fault localization, with GPT-4o mini yielding the highest repair success rate (74.18%). These results highlight the potential of combining formal reasoning with LLM-driven program synthesis for automated program repair.

Problem

Research questions and friction points this paper is trying to address.

Automated repair of arithmetic errors in Dafny programs

Leveraging formal specifications for precise fault localization

Combining Hoare Logic and LLMs to synthesize fixes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses formal specifications for fault localization

Leverages LLMs to synthesize candidate fixes

Combines Hoare Logic with state-of-the-art LLMs

🔎 Similar Papers

An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications