🤖 AI Summary
This study addresses hallucination—syntactically valid but semantically incorrect, API-inconsistent, or unverifiable outputs—in large language models (LLMs) generating safety-critical automotive software code. Method: We design a multi-level prompting strategy (single-line instruction, Covesa Vehicle Signal Specification integration, and code skeleton guidance) and systematically vary contextual richness as a controlled variable, conducting cross-model evaluations (e.g., GPT-4.1, GPT-4o). Contribution/Results: We empirically identify contextual richness as a critical moderator of hallucination: only under maximal context do certain models produce functionally correct code; all other conditions exhibit pervasive syntactic errors and logical inconsistencies. Our findings provide the first empirical evidence and a methodological framework for assessing and improving the reliability of LLM-generated code in safety-critical domains, directly informing prompt engineering and trustworthiness evaluation for autonomous driving software development.
📝 Abstract
Large Language Models (LLMs) have shown significant potential in automating code generation tasks offering new opportunities across software engineering domains. However, their practical application remains limited due to hallucinations - outputs that appear plausible but are factually incorrect, unverifiable or nonsensical. This paper investigates hallucination phenomena in the context of code generation with a specific focus on the automotive domain. A case study is presented that evaluates multiple code LLMs for three different prompting complexities ranging from a minimal one-liner prompt to a prompt with Covesa Vehicle Signal Specifications (VSS) as additional context and finally to a prompt with an additional code skeleton. The evaluation reveals a high frequency of syntax violations, invalid reference errors and API knowledge conflicts in state-of-the-art models GPT-4.1, Codex and GPT-4o. Among the evaluated models, only GPT-4.1 and GPT-4o were able to produce a correct solution when given the most context-rich prompt. Simpler prompting strategies failed to yield a working result, even after multiple refinement iterations. These findings highlight the need for effective mitigation techniques to ensure the safe and reliable use of LLM generated code, especially in safety-critical domains such as automotive software systems.