AgentCaster: Reasoning-Guided Tornado Forecasting

📅 2025-10-02

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work evaluates the practical capabilities of large language models (LLMs) on a real-world, high-stakes, long-horizon reasoning task: tornado risk forecasting. We propose the first data-contamination-free multimodal LLM framework that jointly processes high-resolution convective forecast imagery (3,625 maps) and radiosonde profiles (40,125 soundings), generating probabilistic risk polygons via geometric alignment in a projected coordinate system. We introduce TornadoBench—a novel benchmark—and TornadoHallucination, a quantitative metric for hallucination in dynamic spatiotemporal reasoning and hazard prediction. Experiments span 40 days of historical tornado events, revealing pervasive over-prediction and significant geospatial localization errors; LLM performance falls markedly short of human expert accuracy. These findings expose fundamental limitations of current LLMs in complex dynamical systems reasoning and establish a new evaluation paradigm—grounded in empirical meteorological validation—for trustworthy AI in operational forecasting.

Technology Category

Application Category

📝 Abstract

There is a growing need to evaluate Large Language Models (LLMs) on complex, high-impact, real-world tasks to assess their true readiness as reasoning agents. To address this gap, we introduce AgentCaster, a contamination-free framework employing multimodal LLMs end-to-end for the challenging, long-horizon task of tornado forecasting. Within AgentCaster, models interpret heterogeneous spatiotemporal data from a high-resolution convection-allowing forecast archive. We assess model performance over a 40-day period featuring diverse historical data, spanning several major tornado outbreaks and including over 500 tornado reports. Each day, models query interactively from a pool of 3,625 forecast maps and 40,125 forecast soundings for a forecast horizon of 12-36 hours. Probabilistic tornado-risk polygon predictions are verified against ground truths derived from geometric comparisons across disjoint risk bands in projected coordinate space. To quantify accuracy, we propose domain-specific TornadoBench and TornadoHallucination metrics, with TornadoBench highly challenging for both LLMs and domain expert human forecasters. Notably, human experts significantly outperform state-of-the-art models, which demonstrate a strong tendency to hallucinate and overpredict risk intensity, struggle with precise geographic placement, and exhibit poor spatiotemporal reasoning in complex, dynamically evolving systems. AgentCaster aims to advance research on improving LLM agents for challenging reasoning tasks in critical domains.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs on complex real-world tasks like tornado forecasting

Developing contamination-free framework for long-horizon severe weather prediction

Addressing LLM limitations in spatiotemporal reasoning and risk assessment

Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal LLMs interpret spatiotemporal weather data

Probabilistic tornado-risk polygons verified against ground truth

Domain-specific metrics TornadoBench and TornadoHallucination quantify accuracy

🔎 Similar Papers

No similar papers found.