🤖 AI Summary
Existing question-answering benchmarks struggle to evaluate models’ ability to handle uncertain and conflicting information in disaster management scenarios. This work proposes DisastQA—the first large-scale QA benchmark tailored for disaster response—encompassing eight disaster types and 3,000 rigorously validated questions. A human-in-the-loop construction pipeline combined with stratified sampling ensures data balance, while multi-evidence conditions are designed to disentangle models’ reliance on intrinsic knowledge versus robust reasoning under noise. The study further introduces a novel keypoint-guided open-ended QA evaluation protocol. Experiments across 20 models reveal performance rankings markedly different from those on general-purpose benchmarks like MMLU-Pro; notably, open-source models perform comparably to closed-source counterparts on clean data but suffer significant degradation under noisy conditions, exposing their limited reliability in emergency contexts.
📝 Abstract
Accurate question answering (QA) in disaster management requires reasoning over uncertain and conflicting information, a setting poorly captured by existing benchmarks built on clean evidence. We introduce DisastQA, a large-scale benchmark of 3,000 rigorously verified questions (2,000 multiple-choice and 1,000 open-ended) spanning eight disaster types. The benchmark is constructed via a human-LLM collaboration pipeline with stratified sampling to ensure balanced coverage. Models are evaluated under varying evidence conditions, from closed-book to noisy evidence integration, enabling separation of internal knowledge from reasoning under imperfect information. For open-ended QA, we propose a human-verified keypoint-based evaluation protocol emphasizing factual completeness over verbosity. Experiments with 20 models reveal substantial divergences from general-purpose leaderboards such as MMLU-Pro. While recent open-weight models approach proprietary systems in clean settings, performance degrades sharply under realistic noise, exposing critical reliability gaps for disaster response. All code, data, and evaluation resources are available at https://github.com/TamuChen18/DisastQA_open.