SmokeBench: Evaluating Multimodal Large Language Models for Wildfire Smoke Detection

📅 2025-12-11

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work addresses the challenge of detecting and localizing early-stage wildfire smoke—characterized by transparency, amorphous morphology, and high visual similarity to clouds. To this end, we introduce SmokeBench, the first dedicated multimodal large language model (MLLM) benchmark for smoke understanding, covering four tasks: classification, tile-level localization, grid-based localization, and end-to-end detection. We conduct systematic evaluations on a curated smoke image dataset using state-of-the-art models including Idefics2, Qwen2.5-VL, InternVL3, GPT-4o, and Grounding DINO. Our analysis reveals, for the first time, that smoke volume—not contrast—is the dominant factor governing MLLM performance (r > 0.82). Moreover, we empirically demonstrate fundamental limitations of current models in fine-grained spatial localization. These findings establish a novel evaluation paradigm for safety-critical remote sensing interpretation and provide both theoretical insights and a standardized benchmark to advance smoke perception research.

Technology Category

Application Category

📝 Abstract

Wildfire smoke is transparent, amorphous, and often visually confounded with clouds, making early-stage detection particularly challenging. In this work, we introduce a benchmark, called SmokeBench, to evaluate the ability of multimodal large language models (MLLMs) to recognize and localize wildfire smoke in images. The benchmark consists of four tasks: (1) smoke classification, (2) tile-based smoke localization, (3) grid-based smoke localization, and (4) smoke detection. We evaluate several MLLMs, including Idefics2, Qwen2.5-VL, InternVL3, Unified-IO 2, Grounding DINO, GPT-4o, and Gemini-2.5 Pro. Our results show that while some models can classify the presence of smoke when it covers a large area, all models struggle with accurate localization, especially in the early stages. Further analysis reveals that smoke volume is strongly correlated with model performance, whereas contrast plays a comparatively minor role. These findings highlight critical limitations of current MLLMs for safety-critical wildfire monitoring and underscore the need for methods that improve early-stage smoke localization.

Problem

Research questions and friction points this paper is trying to address.

Evaluating MLLMs for wildfire smoke detection in images

Assessing smoke classification and localization accuracy challenges

Identifying limitations in early-stage smoke detection by models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces SmokeBench benchmark for MLLM evaluation

Evaluates multiple models on smoke classification and localization tasks

Finds models struggle with early-stage smoke localization accuracy

🔎 Similar Papers

No similar papers found.