🤖 AI Summary
Emergency decision-making for ports under tropical cyclone threats remains challenging due to the complexity of integrating heterogeneous, time-sensitive multimodal data. Method: We introduce CyPortQA—the first multimodal benchmark tailored to typhoon scenarios—covering 145 major U.S. ports, 90 historical storms, and nearly 3,000 real-world perturbation scenarios. It integrates probabilistic wind field maps, track cones, official advisories, and port operational status, and employs an automated pipeline to generate over 110,000 structured question-answer pairs. Contribution/Results: Comprehensive evaluation of leading open- and closed-source multimodal large language models (MLLMs) reveals competent performance in basic situational understanding but significant deficiencies in impact quantification and actionable decision reasoning. CyPortQA fills a critical gap in multimodal evaluation for extreme-weather response in essential infrastructure, establishing a rigorous benchmark and identifying concrete directions for improving MLLM reliability in high-stakes, real-world applications.
📝 Abstract
As tropical cyclones intensify and track forecasts become increasingly uncertain, U.S. ports face heightened supply-chain risk under extreme weather conditions. Port operators need to rapidly synthesize diverse multimodal forecast products, such as probabilistic wind maps, track cones, and official advisories, into clear, actionable guidance as cyclones approach. Multimodal large language models (MLLMs) offer a powerful means to integrate these heterogeneous data sources alongside broader contextual knowledge, yet their accuracy and reliability in the specific context of port cyclone preparedness have not been rigorously evaluated. To fill this gap, we introduce CyPortQA, the first multimodal benchmark tailored to port operations under cyclone threat. CyPortQA assembles 2,917 realworld disruption scenarios from 2015 through 2023, spanning 145 U.S. principal ports and 90 named storms. Each scenario fuses multisource data (i.e., tropical cyclone products, port operational impact records, and port condition bulletins) and is expanded through an automated pipeline into 117,178 structured question answer pairs. Using this benchmark, we conduct extensive experiments on diverse MLLMs, including both open-source and proprietary model. MLLMs demonstrate great potential in situation understanding but still face considerable challenges in reasoning tasks, including potential impact estimation and decision reasoning.