Reinforcement Learning with Negative Tests as Completeness Signal for Formal Specification Synthesis

📅 2026-04-07

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

Existing approaches to program specification synthesis lack fine-grained feedback on specification completeness, as reliance on binary verification outcomes makes it difficult to distinguish weak from strong specifications. This work proposes SpecRL, a framework that introduces negative examples—input-output pairs the program cannot produce—as a quantifiable signal of completeness. Within Dafny, SpecRL establishes an end-to-end reinforcement learning pipeline that automatically generates negative examples and uses the proportion rejected by candidate specifications as a completeness-based reward to guide synthesis. Experimental results demonstrate that SpecRL consistently outperforms both supervised fine-tuning and reinforcement learning methods using binary strength rewards across four model scales, significantly improving specification strength and verification success rates. Moreover, SpecRL surpasses larger general-purpose language models on out-of-distribution benchmarks.

Technology Category

Application Category

📝 Abstract

The specification synthesis task aims to automatically generate specifications, together with any necessary auxiliary verification annotations, for existing programs. This task is important because such specifications serve as behavioral contracts that support modular reasoning and reusable verification across a codebase. At the same time, it remains challenging because verifier-only feedback is fundamentally incomplete: passing verification establishes soundness, but cannot distinguish weak specifications from strong ones. What is missing is a fine-grained signal for specification completeness. We present SpecRL, a reinforcement learning framework for specification synthesis in Dafny. SpecRL introduces a self-contained pipeline that generates negative tests, i.e., input-output pairs that can never be produced by the program. We use the fraction of these negative tests rejected by a candidate specification as a signal of specification completeness, which is integrated into the reward for RL training. Experiments across four model sizes show that SpecRL improves both specification strength and verification success over SFT and RL with a binary specification-strength reward, generalizes to an out-of-distribution benchmark, and remains competitive on that unseen benchmark compared to much larger general-purpose LLMs.

Problem

Research questions and friction points this paper is trying to address.

specification synthesis

completeness

formal verification

reinforcement learning

negative tests

Innovation

Methods, ideas, or system contributions that make the work stand out.

Reinforcement Learning

Specification Synthesis

Negative Tests