Unleashing the Capabilities of Large Vision-Language Models for Intelligent Perception of Roadside Infrastructure

📅 2026-01-15

📈 Citations: 0

✨ Influential: 0

career value

233K/year

🤖 AI Summary

This work addresses the limitations of general-purpose large models in accurately understanding fine-grained attributes and engineering specifications of urban road infrastructure, which undermines the reliability of intelligent perception tasks. To overcome this, the authors propose a domain adaptation framework that transforms large vision-language models into specialized agents by integrating open-vocabulary detection (Grounding DINO), LoRA-based efficient fine-tuning, and a dual-modality retrieval-augmented generation (RAG) mechanism. This synergy enables precise object localization, compliance-aware semantic reasoning, and knowledge-guided optimization, effectively mitigating hallucinations and ensuring adherence to professional standards. Evaluated on a newly curated urban road dataset, the approach achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%.

Technology Category

Application Category

📝 Abstract

Automated perception of urban roadside infrastructure is crucial for smart city management, yet general-purpose models often struggle to capture the necessary fine-grained attributes and domain rules. While Large Vision Language Models (VLMs) excel at open-world recognition, they often struggle to accurately interpret complex facility states in compliance with engineering standards, leading to unreliable performance in real-world applications. To address this, we propose a domain-adapted framework that transforms VLMs into specialized agents for intelligent infrastructure analysis. Our approach integrates a data-efficient fine-tuning strategy with a knowledge-grounded reasoning mechanism. Specifically, we leverage open-vocabulary fine-tuning on Grounding DINO to robustly localize diverse assets with minimal supervision, followed by LoRA-based adaptation on Qwen-VL for deep semantic attribute reasoning. To mitigate hallucinations and enforce professional compliance, we introduce a dual-modality Retrieval-Augmented Generation (RAG) module that dynamically retrieves authoritative industry standards and visual exemplars during inference. Evaluated on a comprehensive new dataset of urban roadside scenes, our framework achieves a detection performance of 58.9 mAP and an attribute recognition accuracy of 95.5%, demonstrating a robust solution for intelligent infrastructure monitoring.

Problem

Research questions and friction points this paper is trying to address.

Large Vision-Language Models

roadside infrastructure

intelligent perception

engineering standards

attribute recognition

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large Vision-Language Models

Domain Adaptation

Retrieval-Augmented Generation