🤖 AI Summary
This work addresses the challenges of inaccurate localization, unstable recognition, and coarse-grained state understanding in digital twin construction of roadside infrastructure from sparse street-view imagery. To overcome these limitations, the authors propose SVII-3D, a unified framework that integrates open-set object detection fine-tuned with LoRA, a spatial attention-based matching network, geometry-guided optimization, and a multimodal prompt-driven vision-language model. This synergistic approach enables high-fidelity 3D reconstruction and fine-grained condition assessment directly from sparse image inputs. The method substantially improves asset recognition accuracy and reduces 3D localization error to the decimeter level, offering a cost-effective, scalable, and high-precision digital solution for intelligent infrastructure operation and maintenance.
📝 Abstract
The automated creation of digital twins and precise asset inventories is a critical task in smart city construction and facility lifecycle management. However, utilizing cost-effective sparse imagery remains challenging due to limited robustness, inaccurate localization, and a lack of fine-grained state understanding. To address these limitations, SVII-3D, a unified framework for holistic asset digitization, is proposed. First, LoRA fine-tuned open-set detection is fused with a spatial-attention matching network to robustly associate observations across sparse views. Second, a geometry-guided refinement mechanism is introduced to resolve structural errors, achieving precise decimeter-level 3D localization. Third, transcending static geometric mapping, a Vision-Language Model agent leveraging multi-modal prompting is incorporated to automatically diagnose fine-grained operational states. Experiments demonstrate that SVII-3D significantly improves identification accuracy and minimizes localization errors. Consequently, this framework offers a scalable, cost-effective solution for high-fidelity infrastructure digitization, effectively bridging the gap between sparse perception and automated intelligent maintenance.