Multi-Modal Building Inspection via Perceiver IO Fusion of Satellite and Street-Level Imagery

📅 2026-05-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of multi-label attribute recognition for buildings from street-view images, where the number of available viewpoints varies and precise localization is required. The authors propose a multimodal fusion framework leveraging a shared DINOv2 backbone and a Perceiver IO architecture. Their approach jointly encodes satellite imagery with an arbitrary number of street-view images using spatial patch tokens and introduces an innovative RGB-M masking strategy that incorporates building outlines as a soft spatial prior via a fourth input channel. Evaluated on a large-scale dataset comprising 32,135 buildings, the model significantly improves recognition performance for street-view–visible attributes—e.g., achieving an 11.3-point gain in average precision for slate material—and outperforms existing fusion methods overall, demonstrating strong scalability and practical utility.
📝 Abstract
We present a multi-modal classification framework that fuses satellite and street-level imagery through a Perceiver IO architecture operating on spatial patch tokens from a shared DINOv2 backbone. The design naturally handles a variable number of street-level views per building without padding or fixed-size pooling, and jointly predicts multi-label roof element and roof material classes. We construct a large-scale dataset of 32,135 buildings (61,672 segments) spanning ten countries, pairing satellite images with up to eight street-level views per segment and evaluating four masking strategies for isolating the target building. We propose an RGB-M masking strategy that appends the building footprint mask as a fourth input channel, providing a soft spatial prior that outperforms hard cropping across both modalities. The Perceiver IO fusion model improves over all other fusion strategies and yields substantial per-class gains for attributes visible from street level (e.g., +11.3 AP for slate, +1.3 AP for dormers), though the satellite-only baseline retains a slight advantage in macro-averaged mAP for classes that are predominantly visible from above. These results establish a scalable, flexible architecture for multi-modal building inspection that can accommodate heterogeneous inputs and multiple output tasks.
Problem

Research questions and friction points this paper is trying to address.

multi-modal building inspection
satellite imagery
street-level imagery
roof classification
heterogeneous input fusion
Innovation

Methods, ideas, or system contributions that make the work stand out.

Perceiver IO
multi-modal fusion
RGB-M masking
DINOv2 backbone
building inspection