David vs. Goliath: A comparative study of different-sized LLMs for code generation in the domain of automotive scenario generation

📅 2025-10-15

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

To address data scarcity, poor reproducibility, and inconsistent evaluation in natural language-to-Scenic code translation for autonomous driving scenario generation, this paper introduces NL2Scenic—an open-source dataset—and a unified evaluation framework. Methodologically, we propose EDIT-COMP, a novel metric integrating edit similarity and compilation success rate, and combine it with textual metrics (BLEU, chrF, EDIT-SIM) and execution-based validation. We establish a comprehensive multi-model benchmark covering zero-shot, few-shot, chain-of-thought, and retrieval-augmented prompting. Key contributions include: (1) the first systematic evaluation of mainstream large language models on Scenic code generation; (2) empirical findings that medium-scale open-weight models—e.g., Qwen2.5-Coder-14B—achieve 88% of GPT-4o’s expert-rated performance; and (3) demonstration that retrieval augmentation substantially enhances smaller models’ accuracy, offering a cost-effective, reliable pathway for autonomous driving scenario generation.

Technology Category

Application Category

📝 Abstract

Scenario simulation is central to testing autonomous driving systems. Scenic, a domain-specific language (DSL) for CARLA, enables precise and reproducible scenarios, but NL-to-Scenic generation with large language models (LLMs) suffers from scarce data, limited reproducibility, and inconsistent metrics. We introduce NL2Scenic, an open dataset and framework with 146 NL/Scenic pairs, a difficulty-stratified 30-case test split, an Example Retriever, and 14 prompting variants (ZS, FS, CoT, SP, MoT). We evaluate 13 models: four proprietary (GPT-4o, GPT-5, Claude-Sonnet-4, Gemini-2.5-pro) and nine open-source code models (Qwen2.5Coder 0.5B-32B; CodeLlama 7B/13B/34B), using text metrics (BLEU, ChrF, EDIT-SIM, CrystalBLEU) and execution metrics (compilation and generation), and compare them with an expert study (n=11). EDIT-SIM correlates best with human judgments; we also propose EDIT-COMP (F1 of EDIT-SIM and compilation) as a robust dataset-level proxy that improves ranking fidelity. GPT-4o performs best overall, while Qwen2.5Coder-14B reaches about 88 percent of its expert score on local hardware. Retrieval-augmented prompting, Few-Shot with Example Retriever (FSER), consistently boosts smaller models, and scaling shows diminishing returns beyond mid-size, with Qwen2.5Coder outperforming CodeLlama at comparable scales. NL2Scenic and EDIT-COMP offer a standardized, reproducible basis for evaluating Scenic code generation and indicate that mid-size open-source models are practical, cost-effective options for autonomous-driving scenario programming.

Problem

Research questions and friction points this paper is trying to address.

Addressing scarce data and reproducibility in NL-to-Scenic code generation

Evaluating different-sized LLMs for automotive scenario generation tasks

Developing robust metrics that correlate with human expert judgments

Innovation

Methods, ideas, or system contributions that make the work stand out.

NL2Scenic framework with dataset and prompting variants

EDIT-COMP metric combining edit similarity and compilation

Retrieval-augmented prompting boosts smaller models performance

🔎 Similar Papers

A Survey on Evaluating Large Language Models in Code Generation Tasks