Task Abstention for Large Language Models in Code Generation

📅 2026-05-16

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of hallucination-induced functional errors in code generation by large language models, which necessitates reliable mechanisms to determine when models should abstain from generating outputs. The study proposes a novel, distribution-free calibration method that leverages multiple hypothesis testing for abstention decisions, requiring neither gold-standard test cases nor external knowledge. By executing multiple generated programs and evaluating the consistency of their outputs, the approach dynamically assesses task risk. It effectively handles semantically equivalent yet syntactically diverse programs and demonstrates significant improvements over existing methods across multiple benchmarks and open-source code large language models. The proposed technique enables more accurate and efficient identification of high-risk generation tasks, thereby enhancing the safety and robustness of code generation systems.

📝 Abstract

Large language models (LLMs) have revolutionized automated code generation. One serious concern, however, is the so-called ``hallucination'', i.e., LLMs may generate seemingly plausible but functionally incorrect code. In this paper, we study the task abstention problem, i.e., determining whether a given LLM should abstain from performing a specific code generation task to avoid likely hallucination. Our approach features a calibrated abstention rule, grounded in the principles of multiple hypothesis testing. The rule assesses generation consistency through code execution outcomes, allowing it to handle syntactic diversity of semantically equivalent code without reliance on oracle test cases or external databases. We prove that our approach provides a rigorous, distribution-free theoretical guarantee on its abstention decisions. We evaluate our method on benchmark datasets using several open-source code LLMs. Results show that our method allows generative models to more accurately and efficiently identify and abstain from tasks that induce hallucination compared to existing techniques, providing a reliable mechanism for safer and more robust code generation.

Problem

Research questions and friction points this paper is trying to address.

task abstention

code generation

hallucination

large language models

abstention decision

Innovation

Methods, ideas, or system contributions that make the work stand out.

task abstention

code generation

hallucination mitigation