🤖 AI Summary
This study addresses the challenges of automated scene understanding in post-disaster aerial imagery, which are exacerbated by visual clutter, cross-event domain shifts, and scarce annotations. It presents the first systematic comparison between open-vocabulary models and conventional supervised approaches in terms of performance limits, failure modes, and practical trade-offs for semantic segmentation and object detection in disaster scenarios. Evaluations are conducted on multi-hazard datasets—including FloodNet+ and RescueNet—using large-scale pretrained vision-language foundation models alongside traditional supervised frameworks. The results demonstrate that, under a fixed label space and with sufficient labeled data, supervised methods still significantly outperform open-vocabulary approaches in detecting small objects and segmenting complex boundaries.
📝 Abstract
Aerial imagery is critical for large-scale post-disaster damage assessment. Automated interpretation remains challenging due to clutter, visual variability, and strong cross-event domain shift, while supervised approaches still rely on costly, task-specific annotations with limited coverage across disaster types and regions. Recent open-vocabulary and foundation vision models offer an appealing alternative, by reducing dependence on fixed label sets and extensive task-specific annotations. Instead, they leverage large-scale pretraining and vision-language representations. These properties are particularly relevant for post-disaster domains, where visual concepts are ambiguous and data availability is constrained. In this work, we present a comparative evaluation of supervised learning and open-vocabulary vision models for post-disaster scene understanding, focusing on semantic segmentation and object detection across multiple datasets, including FloodNet+, RescueNet, DFire, and LADD. We examine performance trends, failure modes, and practical trade-offs between different learning paradigms, providing insight into their applicability for real-world disaster response. The most notable remark across all evaluated benchmarks is that supervised training remains the most reliable approach (i.e., when the label space is fixed and annotations are available), especially for small objects and fine boundary delineation in cluttered scenes.