Improving LLM Code Reasoning via Semantic Equivalence Self-Play with Formal Verification

📅 2026-04-18

📈 Citations: 0

✨ Influential: 0

career value

183K/year

🤖 AI Summary

This work addresses the limited accuracy and generalization of large language models in code semantic equivalence reasoning by proposing a self-play training framework grounded in semantic equivalence. The approach uniquely integrates formal proofs from Liquid Haskell with execution-based counterexamples to construct supervision signals, employing adversarial training between a generator and an evaluator alongside a difficulty-aware curriculum learning strategy. Key contributions include a formal verification–guided supervision mechanism for semantic equivalence, the creation of the OpInstruct-HSx dataset, and substantial empirical gains: up to a 13.3 percentage point accuracy improvement on EquiBench and consistent performance gains on PySecDB, collectively demonstrating the critical role of formal semantics in enhancing model reasoning capabilities.

Technology Category

Application Category

📝 Abstract

We introduce a self-play framework for semantic equivalence in Haskell, utilizing formal verification to guide adversarial training between a generator and an evaluator. The framework leverages Liquid Haskell proofs for validating equivalence and execution-based counterexamples for inequivalence, organized via a difficulty-aware curriculum. To facilitate this, we release \textbf{OpInstruct-HSx}, a synthetic dataset of $\approx$28k validated Haskell programs. Empirical experiments show that our evaluator transfers effectively to downstream tasks, achieving up to 13.3pp accuracy gain on EquiBench and consistent gains on PySecDB. Ablation studies on the SEQ-SINQ regimes indicate that while inequivalence supervision provides data volume, equivalence proofs are uniquely responsible for the model's reasoning capabilities. The entire training pipeline and dataset are publicly released on GitHub and Hugging Face respectively.

Problem

Research questions and friction points this paper is trying to address.

semantic equivalence

code reasoning

formal verification

large language models

Haskell

Innovation

Methods, ideas, or system contributions that make the work stand out.

semantic equivalence

formal verification

self-play