DisasterInsight: A Multimodal Benchmark for Function-Aware and Grounded Disaster Assessment

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

Existing remote sensing vision-language benchmarks struggle to support building functional understanding and instruction robustness in disaster scenarios, limiting their utility for humanitarian response. This work proposes DisasterInsight—the first fine-grained, function-aware multimodal benchmark tailored to humanitarian workflows—reconstructing approximately 112,000 building instances from the xBD dataset and encompassing diverse instruction-based tasks such as building function classification, damage severity assessment, disaster type identification, counting, and structured report generation. The authors develop DI-Chat, a domain-adapted baseline model via LoRA fine-tuning, and conduct systematic evaluations across multiple general-purpose and remote sensing vision-language models. Results show that DI-Chat significantly outperforms baselines in damage grading, disaster typing, and report generation, yet building function classification remains challenging, revealing current vision-language models’ limitations in fine-grained semantic understanding.

Technology Category

Application Category

📝 Abstract

Timely interpretation of satellite imagery is critical for disaster response, yet existing vision-language benchmarks for remote sensing largely focus on coarse labels and image-level recognition, overlooking the functional understanding and instruction robustness required in real humanitarian workflows. We introduce DisasterInsight, a multimodal benchmark designed to evaluate vision-language models (VLMs) on realistic disaster analysis tasks. DisasterInsight restructures the xBD dataset into approximately 112K building-centered instances and supports instruction-diverse evaluation across multiple tasks, including building-function classification, damage-level and disaster-type classification, counting, and structured report generation aligned with humanitarian assessment guidelines. To establish domain-adapted baselines, we propose DI-Chat, obtained by fine-tuning existing VLM backbones on disaster-specific instruction data using parameter-efficient Low-Rank Adaptation (LoRA). Extensive experiments on state-of-the-art generic and remote-sensing VLMs reveal substantial performance gaps across tasks, particularly in damage understanding and structured report generation. DI-Chat achieves significant improvements on damage-level and disaster-type classification as well as report generation quality, while building-function classification remains challenging for all evaluated models. DisasterInsight provides a unified benchmark for studying grounded multimodal reasoning in disaster imagery.

Problem

Research questions and friction points this paper is trying to address.

disaster assessment

vision-language models

functional understanding

instruction robustness

multimodal benchmark

Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark

function-aware assessment

vision-language models