Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

📅 2026-01-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitations of general-purpose large models in accurately understanding fine-grained attributes and engineering specifications of urban road infrastructure, which undermines the reliability of intelligent perception tasks. To overcome this, the authors propose a domain adaptation framework that transforms large vision-language models into specialized agents by integrating open-vocabulary detection (Grounding DINO), LoRA-based efficient fine-tuning, and a dual-modality retrieval-augmented generation (RAG) mechanism. This synergy enables precise object localization, compliance-aware semantic reasoning, and knowledge-guided optimization, effectively mitigating hallucinations and ensuring adherence to professional standards. Evaluated on a newly curated urban road dataset, the approach achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%.

Technology Category

Application Category

📝 Abstract
Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.
Problem

Research questions and friction points this paper is trying to address.

Large Vision-Language Models
roadside infrastructure
intelligent perception
engineering standards
attribute recognition
Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Vision-Language Models
Domain Adaptation
Retrieval-Augmented Generation
Open-Vocabulary Fine-Tuning
LoRA
L
Luxuan Fu
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430079, China
Chong Liu
Chong Liu
Wuhan University
3D Computer VisionLaser Scanning PointCloud Compression
B
Bisheng Yang
State Key Laboratory of Information Engineering in Surveying, Mapping and Remote Sensing (LIESMARS), Wuhan University, Wuhan 430079, China
Zhen Dong
Zhen Dong
Wuhan University
3D Computer VisionIntelligent Transportation SystemUrban Sustainable Development