🤖 AI Summary
To address data scarcity, poor reproducibility, and inconsistent evaluation in natural language-to-Scenic code translation for autonomous driving scenario generation, this paper introduces NL2Scenic—an open-source dataset—and a unified evaluation framework. Methodologically, we propose EDIT-COMP, a novel metric integrating edit similarity and compilation success rate, and combine it with textual metrics (BLEU, chrF, EDIT-SIM) and execution-based validation. We establish a comprehensive multi-model benchmark covering zero-shot, few-shot, chain-of-thought, and retrieval-augmented prompting. Key contributions include: (1) the first systematic evaluation of mainstream large language models on Scenic code generation; (2) empirical findings that medium-scale open-weight models—e.g., Qwen2.5-Coder-14B—achieve 88% of GPT-4o’s expert-rated performance; and (3) demonstration that retrieval augmentation substantially enhances smaller models’ accuracy, offering a cost-effective, reliable pathway for autonomous driving scenario generation.
📝 Abstract
Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.