🤖 AI Summary
This work addresses the challenge of detecting and localizing early-stage wildfire smoke—characterized by transparency, amorphous morphology, and high visual similarity to clouds. To this end, we introduce SmokeBench, the first dedicated multimodal large language model (MLLM) benchmark for smoke understanding, covering four tasks: classification, tile-level localization, grid-based localization, and end-to-end detection. We conduct systematic evaluations on a curated smoke image dataset using state-of-the-art models including Idefics2, Qwen2.5-VL, InternVL3, GPT-4o, and Grounding DINO. Our analysis reveals, for the first time, that smoke volume—not contrast—is the dominant factor governing MLLM performance (r > 0.82). Moreover, we empirically demonstrate fundamental limitations of current models in fine-grained spatial localization. These findings establish a novel evaluation paradigm for safety-critical remote sensing interpretation and provide both theoretical insights and a standardized benchmark to advance smoke perception research.
📝 Abstract
Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.