🤖 AI Summary
Current vision-language models (VLMs) for environmental monitoring neglect sensor-based causal signals, suffer from stylistic biases inherent in single-source captioning, and lack interactive spatiotemporal reasoning capabilities. To address these limitations, we propose the first VLM framework enabling interactive spatiotemporal reasoning by jointly modeling multi-temporal satellite imagery with multimodal environmental sensor data—including temperature, PM₁₀, and CO measurements. We construct a large-scale spatiotemporal paired dataset and employ GPT-4o and Gemini 2.0 to generate diverse, style-robust annotations, mitigating caption-style bias. Our model is built upon the Qwen-2.5-VL architecture and fine-tuned efficiently via LoRA. Experiments demonstrate state-of-the-art performance: a BERT-F1 score of 0.903 on temporal reasoning and counterfactual (“what-if”) analysis tasks—matching or surpassing advanced time-series models—while significantly enhancing interpretability and embodied interactive capability for understanding environmental change.
📝 Abstract
Understanding environmental changes from aerial imagery is vital for climate resilience, urban planning, and ecosystem monitoring. Yet, current vision language models (VLMs) overlook causal signals from environmental sensors, rely on single-source captions prone to stylistic bias, and lack interactive scenario-based reasoning. We present ChatENV, the first interactive VLM that jointly reasons over satellite image pairs and real-world sensor data. Our framework: (i) creates a 177k-image dataset forming 152k temporal pairs across 62 land-use classes in 197 countries with rich sensor metadata (e.g., temperature, PM10, CO); (ii) annotates data using GPT- 4o and Gemini 2.0 for stylistic and semantic diversity; and (iii) fine-tunes Qwen-2.5-VL using efficient Low-Rank Adaptation (LoRA) adapters for chat purposes. ChatENV achieves strong performance in temporal and "what-if" reasoning (e.g., BERT-F1 0.903) and rivals or outperforms state-of-the-art temporal models, while supporting interactive scenario-based analysis. This positions ChatENV as a powerful tool for grounded, sensor-aware environmental monitoring.