Hallucination in LLM-Based Code Generation: An Automotive Case Study

📅 2025-08-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses hallucination—syntactically valid but semantically incorrect, API-inconsistent, or unverifiable outputs—in large language models (LLMs) generating safety-critical automotive software code. Method: We design a multi-level prompting strategy (single-line instruction, Covesa Vehicle Signal Specification integration, and code skeleton guidance) and systematically vary contextual richness as a controlled variable, conducting cross-model evaluations (e.g., GPT-4.1, GPT-4o). Contribution/Results: We empirically identify contextual richness as a critical moderator of hallucination: only under maximal context do certain models produce functionally correct code; all other conditions exhibit pervasive syntactic errors and logical inconsistencies. Our findings provide the first empirical evidence and a methodological framework for assessing and improving the reliability of LLM-generated code in safety-critical domains, directly informing prompt engineering and trustworthiness evaluation for autonomous driving software development.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems.
Problem

Research questions and friction points this paper is trying to address.

Investigates hallucination in LLM-based automotive code generation
Evaluates syntax errors and API conflicts in top code LLMs
Highlights need for reliable LLM code in safety-critical systems
Innovation

Methods, ideas, or system contributions that make the work stand out.

Evaluates code LLMs with varying prompt complexities
Uses Covesa VSS as context for prompting
Tests models with additional code skeleton prompts
🔎 Similar Papers
No similar papers found.