ProofWright: Towards Agentic Formal Verification of CUDA

📅 2025-11-15

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

LLM-generated CUDA kernels often contain subtle correctness defects; runtime testing provides insufficient coverage, while manual formal verification is intractable at scale—creating a critical verification bottleneck. This paper introduces the first agent-based formal verification framework tailored for LLM-generated CUDA code, integrating large language models, automated formal verification tools, and a modular agent architecture to enable end-to-end, fully automated verification of memory safety, thread safety, and semantic equivalence. The framework jointly reasons about memory access patterns, concurrent execution behavior, and functional semantics—overcoming scalability limitations of conventional approaches. Evaluated on the KernelBench L1 benchmark, it successfully verifies safety properties for 74% of kernels, uncovers deep correctness errors missed by testing, and establishes semantic equivalence for element-wise kernels. Average per-kernel verification time is only three minutes.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are increasingly used to automatically generate optimized CUDA kernels, substantially improving developer productivity. However, despite rapid generation, these kernels often contain subtle correctness bugs and lack formal safety guarantees. Runtime testing is inherently unreliable - limited input coverage and reward hacking can mask incorrect behavior - while manual formal verification is reliable but cannot scale to match LLM output rates, creating a critical validation bottleneck. We present ProofWright, an agentic verification framework that bridges this gap by integrating automated formal verification with LLM-based code generation. ProofWright provides end-to-end guarantees of memory safety, thread safety, and semantic correctness for LLM-generated CUDA kernels. On KernelBench L1, ProofWright verifies safety properties for 74% of generated kernels, uncovers subtle correctness errors missed by conventional testing, and establishes semantic equivalence for a class of element-wise kernels. With a modest overhead of 3 minutes per kernel, ProofWright demonstrates that scalable, automated formal verification of LLM-generated GPU code is feasible - offering a path toward trustworthy high-performance code generation without sacrificing developer productivity.

Problem

Research questions and friction points this paper is trying to address.

Ensuring formal safety guarantees for LLM-generated CUDA kernels

Detecting subtle correctness bugs missed by conventional testing

Scaling automated verification to match LLM code generation rates

Innovation

Methods, ideas, or system contributions that make the work stand out.

Agentic framework integrates formal verification with LLMs

Provides end-to-end safety guarantees for CUDA kernels

Automates semantic equivalence proofs for GPU code

🔎 Similar Papers

No similar papers found.