🤖 AI Summary
To address safety risks arising from non-determinism in large language models (LLMs) during safety-critical code co-generation for ADAS/autonomous driving (AD), and the high burden of manual review, this paper establishes the first functional-safety-oriented evaluation framework for LLM-generated code. Methodologically, it introduces an automated robustness checking and failure mode classification pipeline; integrates six mainstream code-generation LLMs—including CodeLlama; designs a multi-dimensional static/dynamic reasonableness verification mechanism; and constructs the first LLM-specific failure mode taxonomy and ASIL-aware failure classification catalog for automotive safety-critical programming tasks. Contributions include systematic identification of prevalent failure patterns—e.g., boundary handling, state-machine modeling, real-time constraints, and ASIL-compliant logic—and empirical validation demonstrating significant improvements in review efficiency and defect detection rate. The framework provides a reusable, trustworthy assessment paradigm for LLM-augmented development of autonomous driving software.
📝 Abstract
Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.