ChatENV: An Interactive Vision-Language Model for Sensor-Guided Environmental Monitoring and Scenario Simulation

📅 2025-08-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) for environmental monitoring neglect sensor-based causal signals, suffer from stylistic biases inherent in single-source captioning, and lack interactive spatiotemporal reasoning capabilities. To address these limitations, we propose the first VLM framework enabling interactive spatiotemporal reasoning by jointly modeling multi-temporal satellite imagery with multimodal environmental sensor data—including temperature, PM₁₀, and CO measurements. We construct a large-scale spatiotemporal paired dataset and employ GPT-4o and Gemini 2.0 to generate diverse, style-robust annotations, mitigating caption-style bias. Our model is built upon the Qwen-2.5-VL architecture and fine-tuned efficiently via LoRA. Experiments demonstrate state-of-the-art performance: a BERT-F1 score of 0.903 on temporal reasoning and counterfactual (“what-if”) analysis tasks—matching or surpassing advanced time-series models—while significantly enhancing interpretability and embodied interactive capability for understanding environmental change.

Technology Category

Application Category

📝 Abstract
Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.
Problem

Research questions and friction points this paper is trying to address.

Enhances environmental monitoring with sensor-augmented vision-language models
Addresses bias in single-source captions via multi-style annotations
Enables interactive scenario-based reasoning for climate resilience
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates satellite images with sensor data
Uses GPT-4o and Gemini for diverse annotations
Fine-tunes Qwen-2.5-VL with LoRA adapters
🔎 Similar Papers
No similar papers found.
H
Hosam Elgendy
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
A
Ahmed Sharshar
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE
Ahmed Aboeitta
Ahmed Aboeitta
Master's Student
M
Mohsen Guizani
Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, UAE