An Empirical Evaluation of Pre-trained Large Language Models for Repairing Declarative Formal Specifications

📅 2024-04-17

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 1

career value

193K/year

🤖 AI Summary

This work addresses the challenge of automated repair for declarative formal specifications—exemplified by Alloy—where existing automated program repair (APR) techniques predominantly target imperative languages and lack effective support for declarative specifications. We propose the first dual-agent LLM framework tailored for declarative specification repair: a Repair Agent that generates fixes and a Prompt Agent that dynamically refines prompts and integrates feedback from formal verification. Leveraging models such as GPT-4, combined with advanced prompt engineering and collaborative reasoning, our approach achieves substantial performance gains on standard Alloy benchmarks—significantly outperforming state-of-the-art specialized APR tools in repair success rate. Empirical evaluation demonstrates that large language models exhibit strong generalization capability and practical utility in formal specification repair, providing the first systematic validation of LLMs’ effectiveness and superiority in this domain.

Technology Category

Application Category

📝 Abstract

Automatic Program Repair (APR) has garnered significant attention as a practical research domain focused on automatically fixing bugs in programs. While existing APR techniques primarily target imperative programming languages like C and Java, there is a growing need for effective solutions applicable to declarative software specification languages. This paper presents a systematic investigation into the capacity of Large Language Models (LLMs) for repairing declarative specifications in Alloy, a declarative formal language used for software specification. We propose a novel repair pipeline that integrates a dual-agent LLM framework, comprising a Repair Agent and a Prompt Agent. Through extensive empirical evaluation, we compare the effectiveness of LLM-based repair with state-of-the-art Alloy APR techniques on a comprehensive set of benchmarks. Our study reveals that LLMs, particularly GPT-4 variants, outperform existing techniques in terms of repair efficacy, albeit with a marginal increase in runtime and token usage. This research contributes to advancing the field of automatic repair for declarative specifications and highlights the promising potential of LLMs in this domain.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs for repairing declarative formal specifications

Comparing single-agent and dual-agent repair paradigms in Alloy

Assessing auto-prompting and feedback mechanisms in LLM-based repair

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dual-agent LLM setup for declarative repair

Auto-prompting mechanism enhances LLM feedback

First empirical LLM evaluation for Alloy specifications

🔎 Similar Papers

SpecGen: Automated Generation of Formal Program Specifications via Large Language Models