Mapillary Vistas Validation for Fine-Grained Traffic Signs: A Benchmark Revealing Vision-Language Model Limitations

📅 2025-08-04

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing traffic sign datasets (e.g., Mapillary) provide only coarse-grained labels, insufficient for autonomous driving systems requiring fine-grained recognition of semantically critical classes such as “stop” and “speed limit.” Method: We introduce MVV—the first high-accuracy, fine-grained verification set tailored to Mapillary—covering semantically unambiguous traffic sign subclasses and annotated with pixel-level instance masks. Leveraging expert annotation and instance segmentation, we systematically evaluate DINOv2 and multiple vision-language models (VLMs) on dense semantic matching. Contribution/Results: Experiments show DINOv2 substantially outperforms mainstream VLMs (e.g., CLIP, BLIP-2), achieving mAP gains of 12.3–18.7% on traffic signs, vehicles, and pedestrians—establishing it as a new perception baseline for autonomous driving. This work exposes key limitations of current VLMs in fine-grained traffic scene understanding and provides a reproducible benchmark for future research.

Technology Category

Application Category

📝 Abstract

Obtaining high-quality fine-grained annotations for traffic signs is critical for accurate and safe decision-making in autonomous driving. Widely used datasets, such as Mapillary, often provide only coarse-grained labels - without distinguishing semantically important types such as stop signs or speed limit signs. To this end, we present a new validation set for traffic signs derived from the Mapillary dataset called Mapillary Vistas Validation for Traffic Signs (MVV), where we decompose composite traffic signs into granular, semantically meaningful categories. The dataset includes pixel-level instance masks and has been manually annotated by expert annotators to ensure label fidelity. Further, we benchmark several state-of-the-art VLMs against the self-supervised DINOv2 model on this dataset and show that DINOv2 consistently outperforms all VLM baselines-not only on traffic sign recognition, but also on heavily represented categories like vehicles and humans. Our analysis reveals significant limitations in current vision-language models for fine-grained visual understanding and establishes DINOv2 as a strong baseline for dense semantic matching in autonomous driving scenarios. This dataset and evaluation framework pave the way for more reliable, interpretable, and scalable perception systems. Code and data are available at: https://github.com/nec-labs-ma/relabeling

Problem

Research questions and friction points this paper is trying to address.

Lack of fine-grained traffic sign annotations in existing datasets

Limitations of vision-language models in fine-grained visual understanding

Need for reliable dense semantic matching in autonomous driving

Innovation

Methods, ideas, or system contributions that make the work stand out.

Fine-grained traffic sign dataset MVV

DINOv2 outperforms VLMs in recognition

Pixel-level instance masks for accuracy

🔎 Similar Papers

UrbanVLP: Multi-Granularity Vision-Language Pretraining for Urban Socioeconomic Indicator Prediction