DisasterBench: Benchmarking LLM Planning under Typed Tool Interface Constraints

📅 2026-05-27

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge that large language models struggle to generate multi-step executable workflows for disaster response with correctly bound parameters and consistent dependencies. To this end, the authors propose DisasterBench—the first benchmark specifically designed for evaluating planning under tool interface constraints—and introduce a First Point of Failure (FPoF) mechanism to distinguish root errors from cascading failures. Experimental results demonstrate that planning performance is highly dependent on model capacity, with tool mismatch and parameter binding errors being the predominant sources of initial failure; furthermore, extended reasoning often conflicts with structured output requirements. This study reveals a fundamental gap between semantic reasoning and execution consistency, offering a novel evaluation paradigm and analytical toolkit for multi-agent collaborative planning in disaster scenarios.

📝 Abstract

Disasters cause severe societal impacts, demanding rapid coordination of heterogeneous AI tools, from satellite analysis to flood prediction and damage assessment, into coherent multi-step workflows. As LLMs increasingly serve as orchestrators of such pipelines, effective coordination requires more than selecting semantically plausible tools: LLMs must generate executable workflows with correct parameter binding and dependency propagation. We introduce DisasterBench, a benchmark for evaluating structured multi-agent planning over semantically similar but operationally distinct disaster-response tools. To enable step-level failure attribution, we further propose First-Point-of-Failure (FPoF), which localizes the earliest root cause in a predicted workflow, separating primary errors from downstream cascading effects. Our evaluation reveals three findings: planning method effectiveness depends strongly on model capacity; tool mismatch and parameter-binding errors dominate first failures, revealing semantic grounding and execution consistency as distinct bottlenecks; and verbose intermediate reasoning can create instruction clash with structured output requirements, disrupting plan generation. Together, these findings highlight a fundamental gap between semantic reasoning and execution-grounded coordination, underscoring the need for planning frameworks that jointly model semantic intent, execution constraints, and workflow consistency. Code, data, and evaluation resources are available at: https://github.com/TamuChen18/DisasterBench_Open

Problem

Research questions and friction points this paper is trying to address.

LLM planning

tool coordination

executable workflow

parameter binding

disaster response

Innovation

Methods, ideas, or system contributions that make the work stand out.

DisasterBench

LLM planning

tool interface constraints