🤖 AI Summary
Precipitation nowcasting is highly challenging due to atmospheric chaos and strong spatiotemporal coupling. Existing diffusion models face scalability bottlenecks: latent-space approaches rely on auxiliary autoencoders, compromising generalization; pixel-space methods incur high computational costs and lack attention mechanisms, hindering long-range spatiotemporal dependency modeling. To address these limitations, we propose the Token-level Attention Diffusion model (TAD), which natively integrates lightweight tokenized attention into both the U-Net backbone and spatiotemporal encoders for end-to-end radar echo sequence prediction—eliminating the need for pre-trained autoencoders. This design simultaneously captures multi-scale spatial interactions and dynamic temporal evolution under low computational overhead. Experiments demonstrate that TAD significantly outperforms state-of-the-art methods across multiple benchmarks, notably improving local detail fidelity, cross-domain generalization, and robustness in complex weather scenarios.
📝 Abstract
Precipitation nowcasting, predicting future radar echo sequences from current observations, is a critical yet challenging task due to the inherently chaotic and tightly coupled spatio-temporal dynamics of the atmosphere. While recent advances in diffusion-based models attempt to capture both large-scale motion and fine-grained stochastic variability, they often suffer from scalability issues: latent-space approaches require a separately trained autoencoder, adding complexity and limiting generalization, while pixel-space approaches are computationally intensive and often omit attention mechanisms, reducing their ability to model long-range spatio-temporal dependencies. To address these limitations, we propose a Token-wise Attention integrated into not only the U-Net diffusion model but also the spatio-temporal encoder that dynamically captures multi-scale spatial interactions and temporal evolution. Unlike prior approaches, our method natively integrates attention into the architecture without incurring the high resource cost typical of pixel-space diffusion, thereby eliminating the need for separate latent modules. Our extensive experiments and visual evaluations across diverse datasets demonstrate that the proposed method significantly outperforms state-of-the-art approaches, yielding superior local fidelity, generalization, and robustness in complex precipitation forecasting scenarios.