🤖 AI Summary
This work addresses the challenges of high computational complexity, difficulty in incremental updates, and irregular timestamps in multimodal satellite image time series for real-time land monitoring. The authors propose a dual-form attention mechanism that integrates linear attention with the Retention module to construct a spectral-temporal encoder capable of parallel training and efficient recurrent inference. Furthermore, they introduce a timestamp-aware temporal alignment strategy based on actual acquisition dates to effectively handle temporal misalignment across multisource (Sentinel-1/2) imagery. The proposed method achieves performance comparable to standard Transformers while significantly improving inference efficiency. It outperforms unimodal baselines in both multimodal time-series forecasting and photovoltaic power plant construction monitoring, demonstrating its practical utility for large-scale dynamic land monitoring.
📝 Abstract
Multi-modal Satellite Image Time Series (SITS) analysis faces significant computational challenges for live land monitoring applications. While Transformer architectures excel at capturing temporal dependencies and fusing multi-modal data, their quadratic computational complexity and the need to reprocess entire sequences for each new acquisition limit their deployment for regular, large-area monitoring. This paper studies various dual-form attention mechanisms for efficient multi-modal SITS analysis, that enable parallel training while supporting recurrent inference for incremental processing. We compare linear attention and retention mechanisms within a multi-modal spectro-temporal encoder. To address SITS-specific challenges of temporal irregularity and unalignment, we develop temporal adaptations of dual-form mechanisms that compute token distances based on actual acquisition dates rather than sequence indices. Our approach is evaluated on two tasks using Sentinel-1 and Sentinel-2 data: multi-modal SITS forecasting as a proxy task, and real-world solar panel construction monitoring. Experimental results demonstrate that dual-form mechanisms achieve performance comparable to standard Transformers while enabling efficient recurrent inference. The multimodal framework consistently outperforms mono-modal approaches across both tasks, demonstrating the effectiveness of dual mechanisms for sensor fusion. The results presented in this work open new opportunities for operational land monitoring systems requiring regular updates over large geographic areas.