Predicting Program Correctness By Ensemble Semantic Entropy

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the tendency of large language models to produce consistently incorrect programs due to the absence of reliable correctness verification mechanisms. To mitigate this issue, the authors propose Ensemble Semantic Entropy (ESE) as an efficient proxy metric for program correctness, leveraging semantic consistency across multiple models to more accurately quantify the uncertainty of generated code. Furthermore, they introduce a cascaded test-time scaling framework to enhance inference efficiency. Experimental results on LiveCodeBench demonstrate that ESE achieves significantly higher correlation with actual program correctness compared to single-model approaches. Under stringent false-positive constraints, ESE improves prediction accuracy by 53.4%, while the cascaded framework reduces computational cost by 64.9% in FLOPs.
📝 Abstract
Large language models (LLMs) have demonstrated remarkable capabilities in generating programs from natural language descriptions, yet ensuring their correctness without an external oracle remains a critical challenge. To solve the challenge, existing methods often rely on uncertainty estimation, measuring the consistency of semantics or execution behaviors across multiple samples generated by a single model. However, we observe that a single model can often converge to a consistent but incorrect solution, rendering such consistency-based proxies ineffective. To address this, we propose Ensemble Semantic Entropy (ESE), which estimates uncertainty by evaluating the consistency of samples aggregated across an ensemble of models. Experiments on LiveCodeBench demonstrate that ESE correlates more strongly with program correctness than single-model semantic entropy. Notably, in selective generation tasks with strict false-positive rate constraints, ESE improves prediction accuracy by 53.4%. Furthermore, by leveraging ESE as the decision signal, we propose a cascading test-time scaling framework Cas, which maintains performance while reducing FLOPs by 64.9% compared to single-model scaling, offering a new perspective on balancing parameter and inference scaling.
Problem

Research questions and friction points this paper is trying to address.

program correctness
uncertainty estimation
large language models
semantic consistency
code generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Ensemble Semantic Entropy
Program Correctness Prediction
Uncertainty Estimation
Test-Time Scaling
Large Language Models
🔎 Similar Papers
No similar papers found.