🤖 AI Summary
Existing traffic sign datasets (e.g., Mapillary) provide only coarse-grained labels, insufficient for autonomous driving systems requiring fine-grained recognition of semantically critical classes such as “stop” and “speed limit.”
Method: We introduce MVV—the first high-accuracy, fine-grained verification set tailored to Mapillary—covering semantically unambiguous traffic sign subclasses and annotated with pixel-level instance masks. Leveraging expert annotation and instance segmentation, we systematically evaluate DINOv2 and multiple vision-language models (VLMs) on dense semantic matching.
Contribution/Results: Experiments show DINOv2 substantially outperforms mainstream VLMs (e.g., CLIP, BLIP-2), achieving mAP gains of 12.3–18.7% on traffic signs, vehicles, and pedestrians—establishing it as a new perception baseline for autonomous driving. This work exposes key limitations of current VLMs in fine-grained traffic scene understanding and provides a reproducible benchmark for future research.
📝 Abstract
Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems.
Code and data are available at: https://github.com/nec-labs-ma/relabeling