FIRE-VLM: A Vision-Language-Driven Reinforcement Learning Framework for UAV Wildfire Tracking in a Physics-Grounded Fire Digital Twin

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

This study addresses the challenges of autonomous wildfire monitoring by unmanned aerial vehicles (UAVs) in real-world environments, where visual degradation, dynamic fire evolution, and scarce training data hinder performance. To overcome these limitations, the authors develop a high-fidelity physics-based digital twin of wildfire scenarios and propose an end-to-end firefront tracking framework that integrates vision-language models (VLMs) with reinforcement learning (RL). The approach introduces a GIS-driven digital twin construction pipeline, a VLM-guided RL architecture, and a novel reward mechanism combining physical and semantic information. Built upon a CLIP-style VLM, the PPO algorithm, USGS terrain data, LANDFIRE fuel datasets, and a semi-physical fire propagation solver, the system reduces fire detection time by up to sixfold across five tasks, significantly extends target visibility duration, and demonstrates—for the first time—the effectiveness of an RL-based UAV system in kilometer-scale, physically realistic wildfire environments.

Technology Category

Application Category

📝 Abstract

Wildfire monitoring demands autonomous systems capable of reasoning under extreme visual degradation, rapidly evolving physical dynamics, and scarce real-world training data. Existing UAV navigation approaches rely on simplified simulators and supervised perception pipelines, and lack embodied agents interacting with physically realistic fire environments. We introduce FIRE-VLM, the first end-to-end vision-language model (VLM) guided reinforcement learning (RL) framework trained entirely within a high-fidelity, physics-grounded wildfire digital twin. Built from USGS Digital Elevation Model (DEM) terrain, LANDFIRE fuel inventories, and semi-physical fire-spread solvers, this twin captures terrain-induced runs, wind-driven acceleration, smoke plume occlusion, and dynamic fuel consumption. Within this environment, a PPO agent with dual-view UAV sensing is guided by a CLIP-style VLM. Wildfire-specific semantic alignment scores, derived from a single prompt describing active fire and smoke plumes, are integrated as potential-based reward shaping signals. Our contributions are: (1) a GIS-to-simulation pipeline for constructing wildfire digital twins; (2) a VLM-guided RL agent for UAV firefront tracking; and (3) a wildfire-aware reward design that combines physical terms with VLM semantics. Across five digital-twin evaluation tasks, our VLM-guided policy reduces time-to-detection by up to 6 times, increases time-in-FOV, and is, to our knowledge, the first RL-based UAV wildfire monitoring system demonstrated in kilometer-scale, physics-grounded digital-twin fires.

Problem

Research questions and friction points this paper is trying to address.

wildfire monitoring

UAV navigation

vision-language model

reinforcement learning

digital twin

Innovation

Methods, ideas, or system contributions that make the work stand out.

vision-language model

reinforcement learning

wildfire digital twin