Descriptor: Distance-Annotated Traffic Perception Question Answering (DTPQA)

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) exhibit insufficient perceptual reliability in safety-critical autonomous driving scenarios—particularly at long distances (>30 m) and under complex traffic conditions. Method: This paper introduces the first distance-sensitive traffic perception evaluation framework. It proposes a novel distance-annotation mechanism and constructs the dual-source DTPQA benchmark, comprising both synthetic (driving-simulator-generated) and real-world data. Leveraging the visual question answering (VQA) paradigm, it designs fine-grained questions targeting critical object recognition and behavioral understanding, with explicit per-object camera distance annotations. Contribution/Results: The framework enables the first systematic quantification of VLM performance degradation between near-range (≤20 m) and far-range (≥30 m) settings. We publicly release both the DTPQA dataset and its synthetic generation toolkit to support reproducible, extensible distance-aware evaluation. Experiments demonstrate the framework’s effectiveness in diagnosing prevalent long-range perception failures across mainstream VLMs.

Technology Category

Application Category

📝 Abstract
The remarkable progress of Vision-Language Models (VLMs) on a variety of tasks has raised interest in their application to automated driving. However, for these models to be trusted in such a safety-critical domain, they must first possess robust perception capabilities, i.e., they must be capable of understanding a traffic scene, which can often be highly complex, with many things happening simultaneously. Moreover, since critical objects and agents in traffic scenes are often at long distances, we require systems with not only strong perception capabilities at close distances (up to 20 meters), but also at long (30+ meters) range. Therefore, it is important to evaluate the perception capabilities of these models in isolation from other skills like reasoning or advanced world knowledge. Distance-Annotated Traffic Perception Question Answering (DTPQA) is a Visual Question Answering (VQA) benchmark designed specifically for this purpose: it can be used to evaluate the perception systems of VLMs in traffic scenarios using trivial yet crucial questions relevant to driving decisions. It consists of two parts: a synthetic benchmark (DTP-Synthetic) created using a simulator, and a real-world benchmark (DTP-Real) built on top of existing images of real traffic scenes. Additionally, DTPQA includes distance annotations, i.e., how far the object in question is from the camera. More specifically, each DTPQA sample consists of (at least): (a) an image, (b) a question, (c) the ground truth answer, and (d) the distance of the object in question, enabling analysis of how VLM performance degrades with increasing object distance. In this article, we provide the dataset itself along with the Python scripts used to create it, which can be used to generate additional data of the same kind.
Problem

Research questions and friction points this paper is trying to address.

Evaluating VLMs' perception in traffic scenes using simple driving-related questions.
Assessing VLM performance degradation with increasing object distance from camera.
Providing distance-annotated benchmarks for robust traffic perception at varied ranges.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic and real-world traffic scene benchmarks
Incorporates distance annotations for object perception analysis
Evaluates Vision-Language Models with trivial driving-relevant questions
🔎 Similar Papers
No similar papers found.
N
Nikos Theodoridis
Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Tim Brophy
Tim Brophy
University of Galway
R
Reenu Mohandas
Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Ganesh Sistu
Ganesh Sistu
Principal Artificial Intelligence Architect, Valeo Ireland
Autonomous DrivingMachine LearningComputer VisionDeep Learning
F
Fiachra Collins
Valeo Vision Systems, Dunmore Road, Tuam, Co. Galway H54 Y276, Ireland
A
Anthony Scanlan
Department of Electronic and Computer Engineering, University of Limerick, Castletroy, Co. Limerick V94 T9PX, Ireland
Ciarán Eising
Ciarán Eising
University of Limerick
computer vision