RoadSceneBench: A Lightweight Benchmark for Mid-Level Road Scene Understanding

📅 2025-11-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing road-scene benchmarks predominantly evaluate low-level perception tasks (e.g., detection, segmentation) and lack rigorous assessment of mid-level semantic reasoning—specifically, road topology and dynamic scene relationships. To bridge the semantic gap between perception and planning, we propose RoadSceneBench: the first lightweight yet information-rich benchmark for mid-level road understanding. Methodologically, we introduce the Hierarchical Relation Reward Propagation (HRRP) mechanism to jointly model spatial topology and temporal dynamics. Further, we design a vision-language-driven HRRP-T training framework that enforces geometric consistency and temporal coherence in joint inference. Evaluated across diverse road configurations, our approach significantly outperforms state-of-the-art methods, demonstrating both the effectiveness and generalizability of explicit mid-level semantic modeling for structured scene understanding.

Technology Category

Application Category

📝 Abstract
Understanding mid-level road semantics, which capture the structural and contextual cues that link low-level perception to high-level planning, is essential for reliable autonomous driving and digital map construction. However, existing benchmarks primarily target perception tasks such as detection or segmentation, overlooking the reasoning capabilities required to infer road topology and dynamic scene structure. To address this gap, we present RoadSceneBench, a lightweight yet information-rich benchmark designed to evaluate and advance visual reasoning in complex road environments. Unlike large-scale perception datasets, RoadSceneBench emphasizes relational understanding and structural consistency, encouraging models to capture the underlying logic of real-world road scenes. Furthermore, to enhance reasoning reliability, we propose Hierarchical Relational Reward Propagation with Temporal Consistency (HRRP-T), a training framework for Vision-Language Models (VLMs) in which reward signals adaptively promote spatial coherence and semantic alignment throughout the reasoning process. This paradigm enables models to move beyond static recognition toward geometry-aware and temporally consistent reasoning. Extensive experiments demonstrate that our method achieves state-of-the-art performance across diverse road configurations. RoadSceneBench thus provides a compact yet powerful foundation for studying mid-level road semantics and fostering structure-aware autonomous perception. Our dataset is available at https://github.com/XiyanLiu/RoadSceneBench.
Problem

Research questions and friction points this paper is trying to address.

Evaluates visual reasoning in complex road environments
Addresses lack of benchmarks for road topology and scene structure
Enhances spatial coherence and semantic alignment in reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

RoadSceneBench benchmark for mid-level road semantics
HRRP-T framework for hierarchical relational reward propagation
Promotes spatial coherence and temporal consistency in reasoning
🔎 Similar Papers
No similar papers found.
Xiyan Liu
Xiyan Liu
Baidu Inc.
Computer Vision
H
Han Wang
Baidu Inc.
Yuhu Wang
Yuhu Wang
Baidu Inc.
J
Junjie Cai
Baidu Inc.
Z
Zhe Cao
Baidu Inc.
J
Jianzhong Yang
Baidu Inc.
Z
Zhen Lu
Baidu Inc.