Large Language Models in Code Co-generation for Safe Autonomous Vehicles

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

career value

236K/year

🤖 AI Summary

To address safety risks arising from non-determinism in large language models (LLMs) during safety-critical code co-generation for ADAS/autonomous driving (AD), and the high burden of manual review, this paper establishes the first functional-safety-oriented evaluation framework for LLM-generated code. Methodologically, it introduces an automated robustness checking and failure mode classification pipeline; integrates six mainstream code-generation LLMs—including CodeLlama; designs a multi-dimensional static/dynamic reasonableness verification mechanism; and constructs the first LLM-specific failure mode taxonomy and ASIL-aware failure classification catalog for automotive safety-critical programming tasks. Contributions include systematic identification of prevalent failure patterns—e.g., boundary handling, state-machine modeling, real-time constraints, and ASIL-compliant logic—and empirical validation demonstrating significant improvements in review efficiency and defect detection rate. The framework provides a reusable, trustworthy assessment paradigm for LLM-augmented development of autonomous driving software.

Technology Category

Application Category

📝 Abstract

Software engineers in various industrial domains are already using Large Language Models (LLMs) to accelerate the process of implementing parts of software systems. When considering its potential use for ADAS or AD systems in the automotive context, there is a need to systematically assess this new setup: LLMs entail a well-documented set of risks for safety-related systems' development due to their stochastic nature. To reduce the effort for code reviewers to evaluate LLM-generated code, we propose an evaluation pipeline to conduct sanity-checks on the generated code. We compare the performance of six state-of-the-art LLMs (CodeLlama, CodeGemma, DeepSeek-r1, DeepSeek-Coders, Mistral, and GPT-4) on four safety-related programming tasks. Additionally, we qualitatively analyse the most frequent faults generated by these LLMs, creating a failure-mode catalogue to support human reviewers. Finally, the limitations and capabilities of LLMs in code generation, and the use of the proposed pipeline in the existing process, are discussed.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM risks for safe autonomous vehicle code generation

Evaluating LLM-generated code with a sanity-check pipeline

Analyzing frequent LLM faults to aid human reviewers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes evaluation pipeline for LLM-generated code

Compares six LLMs on safety tasks

Creates failure-mode catalogue for reviewers

🔎 Similar Papers

No similar papers found.