🤖 AI Summary
This work addresses the challenge of automatically counting building floors from street-view facade images by proposing GATA2Floor, a novel model that represents facades as graph structures with windows and doors as nodes, enriched with vertical geometric priors. Leveraging multi-head GATv2 layers and an interpretable cross-attention mechanism, the model jointly predicts the number of floors and softly assigns facade elements to implicit floor slots. Innovatively, it integrates self-supervised visual features with vision-language scoring to establish a weakly supervised learning framework that eliminates the need for explicit floor-level annotations. Experimental results demonstrate that the method achieves robust counting performance on irregular facades, validating the efficacy of graph attention-based relational reasoning for facade understanding while substantially reducing reliance on labeled data.
📝 Abstract
Automated analysis of building facades from street-level imagery has great potential for urban analytics, energy assessment, and emergency planning. However, it requires reasoning over spatially arranged elements rather than solely isolated detections. In this work, we model each facade as a graph over window/door detections with a vertical prior on edges. Additionally, we introduce GATA2Floor, a multi-head Graph Attention v2 (GATv2) based model that predicts the global floor count of a building and, via learnable cross-attention queries, softly assigns elements to latent floor slots, yielding interpretable outputs and robustness to irregular designs. To mitigate the lack of labeled datasets, we demonstrate that the proposed graph-based reasoning can be applied without annotations by leveraging a lightweight label-free proposal mechanism based on self-supervised features and vision-language scoring. Our approach demonstrates the value of graph-attention-based relational reasoning for facade understanding.