Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing vision-language models lack explicit geometric and relational structure, hindering precise reasoning about road layouts, while high-definition maps, though geometrically accurate, suffer from rigid semantic representations. This work proposes the Combined Road Substrate (CRS) framework, which unifies road geometry and open-vocabulary semantics within an executable graph representation. CRS enables recursive graph queries to automatically generate diverse question-answer pairs and introduces a “free grounding” mechanism to ensure logical traceability and programmatically extract chain-of-thought supervision signals. Small-scale models (2B/4B parameters) trained on only 20–80 CRS-augmented scenes significantly outperform untuned large models on compositional reasoning tasks. The dominant failure mode shifts from relational errors to attribute misidentification, revealing that the bottleneck in road understanding stems from a lack of structured supervision rather than model scale.

📝 Abstract

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

Problem

Research questions and friction points this paper is trying to address.

road understanding

visual reasoning

vision-language models

graph representation

structured supervision

Innovation

Methods, ideas, or system contributions that make the work stand out.

graph-based reasoning

structured road understanding

vision-language models