🤖 AI Summary
This work addresses the efficiency and performance limitations of Vision Transformers in generative modeling, which stem from the quadratic complexity of self-attention and weak spatial inductive bias. To overcome these challenges, the authors propose PDE-SSM, a novel architecture that replaces conventional attention mechanisms with a learnable multidimensional convection-diffusion-reaction partial differential equation (PDE) as an efficient spatial mixing operator endowed with strong spatial priors. The PDE is solved in the Fourier domain via spectral methods, enabling near-linear complexity for global modeling. Integrated within a state space model and combined with a diffusion generative framework, the resulting PDE-SSM-DiT achieves performance on par with or superior to state-of-the-art diffusion Transformers in image generation tasks while substantially reducing computational overhead.
📝 Abstract
The success of vision transformers-especially for generative modeling-is limited by the quadratic cost and weak spatial inductive bias of self-attention. We propose PDE-SSM, a spatial state-space block that replaces attention with a learnable convection-diffusion-reaction partial differential equation. This operator encodes a strong spatial prior by modeling information flow via physically grounded dynamics rather than all-to-all token interactions. Solving the PDE in the Fourier domain yields global coupling with near-linear complexity of $O(N \log N)$, delivering a principled and scalable alternative to attention. We integrate PDE-SSM into a flow-matching generative model to obtain the PDE-based Diffusion Transformer PDE-SSM-DiT. Empirically, PDE-SSM-DiT matches or exceeds the performance of state-of-the-art Diffusion Transformers while substantially reducing compute. Our results show that, analogous to 1D settings where SSMs supplant attention, multi-dimensional PDE operators provide an efficient, inductive-bias-rich foundation for next-generation vision models.