S2SServiceBench: A Multimodal Benchmark for Last-Mile S2S Climate Services

📅 2026-02-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study addresses the critical bottleneck in subseasonal-to-seasonal (S2S) climate forecasting—the lack of reliable and actionable services at the “last mile” of operational translation—by introducing S2SServiceBench, the first multimodal evaluation benchmark tailored for S2S climate services. Grounded in real-world service frameworks, the benchmark spans six application domains, ten service product types, and three service tiers, comprising over 500 tasks and more than 1,000 evaluation metrics. It enables fine-grained assessment of multimodal large language models in signal interpretation, uncertainty quantification, and decision-making capabilities. Experimental results expose current limitations in service graph comprehension, executable decision generation, and dynamic hazard-evidence-driven planning, offering systematic guidance for the development of next-generation climate service agents.

Technology Category

Application Category

📝 Abstract
Subseasonal-to-seasonal (S2S) forecasts play an essential role in providing a decision-critical weeks-to-months planning window for climate resilience and sustainability, yet a growing bottleneck is the last-mile gap: translating scientific forecasts into trusted, actionable climate services, requiring reliable multimodal understanding and decision-facing reasoning under uncertainty. Meanwhile, multimodal large language models (MLLMs) and corresponding agentic paradigms have made rapid progress in supporting various workflows, but it remains unclear whether they can reliably generate decision-making deliverables from operational service products (e.g., actionable signal comprehension, decision-making handoff, and decision analysis&planning) under uncertainty. We introduce S2SServiceBench, a multimodal benchmark for last-mile S2S climate services curated from an operational climate-service system to evaluate this capability. S2SServiceBenchcovers 10 service products with about 150+ expert-selected cases in total, spanning six application domains - Agriculture, Disasters, Energy, Finance, Health, and Shipping. Each case is instantiated at three service levels, yielding around 500 tasks and 1,000+ evaluation items across climate resilience and sustainability applications. Using S2SServiceBench, we benchmark state-of-the-art MLLMs and agents, and analyze performance across products and service levels, revealing persistent challenges in S2S service plot understanding and reasoning - namely, actionable signal comprehension, operationalizing uncertainty into executable handoffs, and stable, evidence-grounded analysis and planning for dynamic hazards-while offering actionable guidance for building future climate-service agents.
Problem

Research questions and friction points this paper is trying to address.

last-mile gap
subseasonal-to-seasonal forecasting
climate services
multimodal understanding
decision-facing reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal benchmark
subseasonal-to-seasonal forecasting
climate service agents
decision-facing reasoning
last-mile translation
🔎 Similar Papers
No similar papers found.
Chenyue Li
Chenyue Li
Hong Kong University of Science and Technology
AI for ScienceLarge Language Model
W
Wen Deng
The Hong Kong University of Science and Technology
Z
Zhuotao Sun
The Hong Kong University of Science and Technology
M
Mengxi Jin
The Hong Kong University of Science and Technology
H
Hanzhe Cui
The Hong Kong University of Science and Technology
H
Han Li
Nanjing University of Information Science and Technology
S
Shentong Li
Beijing Normal University
M
Man Kit Yu
The Hong Kong University of Science and Technology
M
Ming Long Lai
The Hong Kong University of Science and Technology
Yuhao Yang
Yuhao Yang
University of Hong Kong
Large Language ModelsAgentic ModelsFoundation ModelsGraph Learning
Mengqian Lu
Mengqian Lu
HKUST, Otto Poon Centre for Climate Resilience and Sustainability
Atmospheric RiversMonsoonsExtreme WeatherPredictionData Science and AI
B
Binhang Yuan
The Hong Kong University of Science and Technology