RESPONSE: Benchmarking the Ability of Language Models to Undertake Commonsense Reasoning in Crisis Situation

📅 2025-03-14

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

Current large language models (LLMs) lack rigorous evaluation of commonsense reasoning capabilities in dynamic crisis scenarios—such as natural disasters—where temporal evolution, resource constraints, and sequential decision-making are critical. Method: We introduce RESPONSE, the first temporally grounded commonsense reasoning benchmark for crisis response, comprising 1,789 real-world disaster instances and 6,037 questions across three tasks: resource gap identification, immediate response decision-making, and delayed intervention planning. We systematically define and evaluate stepwise reasoning across crisis phases and propose a hybrid evaluation protocol integrating automated metrics with expert human scoring by environmental engineers. Contribution/Results: Experiments reveal that GPT-4 achieves only 37% expert-level accuracy on immediate action selection, exposing severe limitations in crisis-aware commonsense reasoning. RESPONSE establishes a reproducible, scalable benchmark and evaluation framework to advance robust, trustworthy AI for emergency response.

Technology Category

Application Category

📝 Abstract

An interesting class of commonsense reasoning problems arises when people are faced with natural disasters. To investigate this topic, we present extsf{RESPONSE}, a human-curated dataset containing 1789 annotated instances featuring 6037 sets of questions designed to assess LLMs' commonsense reasoning in disaster situations across different time frames. The dataset includes problem descriptions, missing resources, time-sensitive solutions, and their justifications, with a subset validated by environmental engineers. Through both automatic metrics and human evaluation, we compare LLM-generated recommendations against human responses. Our findings show that even state-of-the-art models like GPT-4 achieve only 37% human-evaluated correctness for immediate response actions, highlighting significant room for improvement in LLMs' ability for commonsense reasoning in crises.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLMs' commonsense reasoning in disaster scenarios.

Evaluating LLM-generated crisis response recommendations against human responses.

Highlighting gaps in LLMs' ability to provide accurate immediate crisis solutions.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Human-curated dataset for crisis reasoning

Automatic and human evaluation metrics

Comparison of LLM and human responses

🔎 Similar Papers

No similar papers found.