🤖 AI Summary
This work addresses the prevalence of functional errors in code generated by large language models (LLMs) and the limited efficacy of existing uncertainty quantification (UQ) methods for code generation. The authors systematically evaluate the transferability of various UQ approaches and introduce a novel paradigm grounded in functional equivalence: leveraging LLMs to assess whether generated code is functionally equivalent to a reference implementation. Building on this, they define code-specific UQ metrics such as “functional entropy.” By integrating token-level probabilities, sampling strategies, and LLM-driven functional equivalence judgments, their method achieves state-of-the-art performance, attaining the highest AUROC in 11 out of 15 model–benchmark combinations and demonstrating substantially improved calibration over existing techniques.
📝 Abstract
Large language models have shown impressive capabilities in code generation, yet they often produce functionally incorrect code. Uncertainty quantification (UQ) methods have emerged as a promising approach for detecting hallucinations in natural language generation, but their effectiveness for code generation tasks remains underexplored. We systematically evaluate how UQ techniques transfer to code generation across three programming languages, five LLMs, and over 1,700 problems. We find that some token-probability-based methods generalize effectively without modification, while sampling-based methods relying on natural language inference (NLI) fail because NLI models cannot distinguish functionally different code, causing most responses to collapse into a single semantic cluster. To address this, we introduce functional equivalence methods, a family of code-specific methods that replace NLI-based semantic equivalence with an LLM-based functional equivalence assessment, including functional entropy, a code-specific analog of semantic entropy. Functional equivalence methods achieve top AUROC in 11 out of 15 model-benchmark combinations and the best calibration across most settings, consistently outperforming both NLI-based counterparts and all other methods evaluated.