🤖 AI Summary
This work investigates whether large language models (LLMs) can directly and robustly translate abstract, standards-derived requirements into executable CARLA simulation configuration code in automotive industrial settings. Methodologically, it conducts the first systematic end-to-end “requirement → code” evaluation of LLMs—employing prompt engineering, domain-knowledge injection, and CARLA API-constrained decoding—on open-source models including Llama and Mistral. Results show that while LLMs achieve 68% functional correctness on structured requirements, failure rates rise sharply to 41% when confronted with real-world industrial challenges such as ambiguous phrasing or cross-document references. The study reveals a significant gap between current LLMs’ high-level semantic understanding capabilities and practical industrial deployment readiness, underscoring the indispensable role of human oversight. Furthermore, it establishes a novel benchmark and methodological framework for evaluating domain-specific LLMs in safety-critical, standards-driven domains.
📝 Abstract
Large Language Models (LLMs) are taking many industries by storm. They possess impressive reasoning capabilities and are capable of handling complex problems, as shown by their steadily improving scores on coding and mathematical benchmarks. However, are the models currently available truly capable of addressing real-world challenges, such as those found in the automotive industry? How well can they understand high-level, abstract instructions? Can they translate these instructions directly into functional code, or do they still need help and supervision? In this work, we put one of the current state-of-the-art models to the test. We evaluate its performance in the task of translating abstract requirements, extracted from automotive standards and documents, into configuration code for CARLA simulations.