Semantic VLM Dataset for Safe Autonomous Driving

📅 2025-11-12

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

A lack of scene-level understanding datasets supporting both visual-language model (VLM) training and interpretable evaluation hinders progress in autonomous driving. Method: We introduce CAR-Scenes, a novel dataset covering seven semantic dimensions—environment, road geometry, traffic participant behavior (including vulnerable road users), ego-vehicle behavior, sensor status, and discrete risk levels. We propose a GPT-4o–driven human-in-the-loop annotation paradigm, yielding 5,192 frames of fine-grained annotations across 350+ attributes, augmented with attribute co-occurrence graphs and semantic retrieval capabilities. Using LoRA to fine-tune Qwen2-VL-2B, we perform risk-aware modeling via deterministic decoding and multi-metric evaluation (accuracy, F1, MAE, RMSE). Contribution: We open-source all annotation scripts, benchmark models, and evaluation pipelines, enabling reproducible, data-driven research in intelligent driving.

Technology Category

Application Category

📝 Abstract

CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes

Problem

Research questions and friction points this paper is trying to address.

Creating a vision-language dataset for autonomous driving scene understanding

Providing semantic annotations for risk-aware scenario analysis

Enabling interpretable VLM training with structured attribute labeling

Innovation

Methods, ideas, or system contributions that make the work stand out.

GPT-4o-assisted vision-language pipeline with human verification

28-category knowledge base with 350+ leaf attributes

LoRA-tuned Qwen2-VL-2B model with deterministic decoding

🔎 Similar Papers

No similar papers found.

Authors to Follow