🤖 AI Summary
High-resolution multivariate climate data exhibit strong local dynamics, teleconnections, multiscale interactions, and non-stationarity, posing significant challenges for conventional ConvLSTM2D models in efficiently capturing long-range spatial dependencies and disentangling climate dynamics. To address this, this work proposes FAConvLSTM—a plug-and-play replacement for ConvLSTM2D—that enhances both efficiency and interpretability through a factorized attention mechanism. Specifically, it incorporates lightweight axial spatial attention and a temporal sparsity strategy to capture teleconnections, while integrating multiscale dilated depthwise convolutions with season-aware positional encoding in a temporal self-attention module for effective multiscale modeling. Experiments demonstrate that FAConvLSTM substantially reduces computational overhead while yielding more stable, robust, and physically interpretable latent representations.
📝 Abstract
Learning physically meaningful spatiotemporal representations from high-resolution multivariate Earth observation data is challenging due to strong local dynamics, long-range teleconnections, multi-scale interactions, and nonstationarity. While ConvLSTM2D is a commonly used baseline, its dense convolutional gating incurs high computational cost and its strictly local receptive fields limit the modeling of long-range spatial structure and disentangled climate dynamics. To address these limitations, we propose FAConvLSTM, a Factorized-Attention ConvLSTM layer designed as a drop-in replacement for ConvLSTM2D that simultaneously improves efficiency, spatial expressiveness, and physical interpretability. FAConvLSTM factorizes recurrent gate computations using lightweight [1 times 1] bottlenecks and shared depthwise spatial mixing, substantially reducing channel complexity while preserving recurrent dynamics. Multi-scale dilated depthwise branches and squeeze-and-excitation recalibration enable efficient modeling of interacting physical processes across spatial scales, while peephole connections enhance temporal precision. To capture teleconnection-scale dependencies without incurring global attention cost, FAConvLSTM incorporates a lightweight axial spatial attention mechanism applied sparsely in time. A dedicated subspace head further produces compact per timestep embeddings refined through temporal self-attention with fixed seasonal positional encoding. Experiments on multivariate spatiotemporal climate data shows superiority demonstrating that FAConvLSTM yields more stable, interpretable, and robust latent representations than standard ConvLSTM, while significantly reducing computational overhead.