🤖 AI Summary
A lack of scene-level understanding datasets supporting both visual-language model (VLM) training and interpretable evaluation hinders progress in autonomous driving.
Method: We introduce CAR-Scenes, a novel dataset covering seven semantic dimensions—environment, road geometry, traffic participant behavior (including vulnerable road users), ego-vehicle behavior, sensor status, and discrete risk levels. We propose a GPT-4o–driven human-in-the-loop annotation paradigm, yielding 5,192 frames of fine-grained annotations across 350+ attributes, augmented with attribute co-occurrence graphs and semantic retrieval capabilities. Using LoRA to fine-tune Qwen2-VL-2B, we perform risk-aware modeling via deterministic decoding and multi-metric evaluation (accuracy, F1, MAE, RMSE).
Contribution: We open-source all annotation scripts, benchmark models, and evaluation pipelines, enabling reproducible, data-driven research in intelligent driving.
📝 Abstract
CAR-Scenes is a frame-level dataset for autonomous driving that enables training and evaluation of vision-language models (VLMs) for interpretable, scene-level understanding. We annotate 5,192 images drawn from Argoverse 1, Cityscapes, KITTI, and nuScenes using a 28-key category/sub-category knowledge base covering environment, road geometry, background-vehicle behavior, ego-vehicle behavior, vulnerable road users, sensor states, and a discrete severity scale (1-10), totaling 350+ leaf attributes. Labels are produced by a GPT-4o-assisted vision-language pipeline with human-in-the-loop verification; we release the exact prompts, post-processing rules, and per-field baseline model performance. CAR-Scenes also provides attribute co-occurrence graphs and JSONL records that support semantic retrieval, dataset triage, and risk-aware scenario mining across sources. To calibrate task difficulty, we include reproducible, non-benchmark baselines, notably a LoRA-tuned Qwen2-VL-2B with deterministic decoding, evaluated via scalar accuracy, micro-averaged F1 for list attributes, and severity MAE/RMSE on a fixed validation split. We publicly release the annotation and analysis scripts, including graph construction and evaluation scripts, to enable explainable, data-centric workflows for future intelligent vehicles. Dataset: https://github.com/Croquembouche/CAR-Scenes