Inferring Code Correctness from Specification

📅 2026-05-28

📈 Citations: 0

✨ Influential: 0

career value

148K/year

🤖 AI Summary

Verifying the correctness of code generated by large language models remains challenging. This work proposes TRAILS, a novel approach that, for the first time, integrates specification-guided generation of diverse test inputs with model-based consistency checking against input-output pairs, thereby circumventing the biases of static reasoning and the overhead of dynamic consensus. TRAILS partitions inputs into categories to generate test cases and aggregates scores across multiple inputs to enhance evaluation stability and coverage breadth. Experiments on LiveCodeBench and CoCoClaNeL demonstrate that TRAILS improves the Matthews correlation coefficient by up to 39% over Zero-Shot Chain-of-Thought, significantly outperforming HoarePrompt while yielding more stable results and covering a greater number of unique code samples.

📝 Abstract

Large language models (LLMs) have become integral to modern software development, enabling automated code generation at scale. However, validating the correctness of LLM-generated code remains a critical and largely unsolved challenge. Existing approaches either rely on dynamic consensus across multiple code candidates - making them costly and difficult to scale - or on static reasoning that is susceptible to dynamic bugs and order bias. In this paper, we propose TRAILS~ (Targeted Reasoning Agreement via Inputs and Specifications), an approach that grounds LLM reasoning with concrete (input, output) pairs. TRAILS~ first generates diverse test inputs via category partitioning based on the specification, then executes them against the candidate code and prompts LLMs to assess whether the resulting input-output pairs conform to the specification - without ever reasoning over the code itself. Scores are aggregated across inputs, to determines whether the program is likely correct. We evaluate TRAILS~ on two datasets, LiveCodeBench and CoCoClaNeL, across three LLMs (Qwen3Coder-30B, Devstral-Small-24B, and Olmo3.1-Instruct), comparing against HoarePrompt and a Zero-Shot Chain-of-Thought baseline. TRAILS~ improves Matthew Correlation Coefficient by up to 39\% relative to Zero-Shot COT and consistently outperforms HoarePrompt. Beyond accuracy, TRAILS~ demonstrates greater stability across seeded runs, reducing sensitivity to LLM non-determinism, and assigns correct labels to a larger set of unique code samples than competing approaches.

Problem

Research questions and friction points this paper is trying to address.

code correctness

large language models

specification validation

automated code generation

program verification

Innovation

Methods, ideas, or system contributions that make the work stand out.

code correctness

specification-based validation

input-output conformance