Can LLM Agents Respond to Disasters? Benchmarking Heterogeneous Geospatial Reasoning in Emergency Operations

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the lack of systematic evaluation of end-to-end disaster response workflows, which hinders the assessment of large language models’ (LLMs’) holistic capabilities in complex emergency scenarios. We introduce DORA, the first operational benchmark for disaster response agents, encompassing 45 real-world disasters, 515 expert-defined tasks, and 3,500 reproducible trajectories across five core dimensions: situational awareness, spatial analysis, evacuation planning, temporal reasoning, and multimodal reporting. Built upon a tool library of 108 Model Context Protocol (MCP) functions, DORA integrates multi-source, heterogeneous geospatial data (0.015–10 m resolution) and supports mono-, bi-, and multi-temporal analysis. It defines three domain-specific challenges—semantic grounding, multimodal alignment, and workflow composition—with expert-validated golden trajectories as the evaluation standard. Evaluations across 13 state-of-the-art LLMs reveal significant performance degradation in long-horizon tasks, dual bottlenecks in tool selection and parameter grounding, and limited gains from prompting or auxiliary strategies (≤3.24%), highlighting critical gaps toward reliable emergency response.

📝 Abstract

Operational disaster response goes beyond damage assessment, requiring responders to integrate multi-sensor signals, reason over road networks, populations and key facilities, plan evacuations, and produce actionable reports. However, prior work largely isolates remote-sensing perception or evaluates generic tool use, leaving the end-to-end workflows of emergency operations underexplored. In this paper, we introduce Disaster Operational Response Agent benchmark (DORA), the first agentic benchmark for end-to-end disaster response: 515 expert-authored tasks across 45 real-world disaster events spanning 10 types, paired with expert-verified, replayable gold trajectories totaling 3,500 tool-call steps. Tasks span five dimensions that cover the operational disaster-response pipeline: disaster perception, spatial relational analysis, rescue and evacuation planning, temporal evolution reasoning, and multi-modal report synthesis. Agents compose calls from a 108-tool MCP library over heterogeneous geospatial data: optical, SAR, and multi-spectral imagery across single-, bi-, and multi-temporal sequences (0.015-10m GSD), complemented by elevation and social vector layers. We comprehensively evaluate 13 frontier LLMs on our benchmark, revealing three persistent challenges: 1) disaster-domain grounding exposes unique failure modes (damage-semantic grounding, sensor-modality mismatch, and disaster-pipeline composition); 2) agents are doubly bottlenecked by tool selection and argument grounding, where gold tool-order hints improve accuracy by only 1.08-4.40%, and alternative scaffolds yield at most a 3.24% gain; 3) compositional fragility scales with trajectory length, the agent-to-gold gap widening from 7% to 56% on long pipelines. DORA establishes a rigorous testbed for operationally reliable disaster-response agents.

Problem

Research questions and friction points this paper is trying to address.

disaster response

geospatial reasoning

LLM agents

emergency operations

benchmarking

Innovation

Methods, ideas, or system contributions that make the work stand out.

geospatial reasoning

LLM agents

disaster response benchmark