🤖 AI Summary
This work addresses the inefficiency of modeling long-range couplings among spatial points in partial differential equation (PDE) solvers by proposing a physics-informed, unified low-rank Transformer framework. Leveraging the low-rank structure of global interaction kernels, the method compresses high-dimensional spatial features into a latent space for efficient global interaction and subsequently reconstructs them back to the original space. It unifies prevailing global mixing modules under standard Transformer primitives—attention, normalization, and feedforward networks—for the first time, integrating low-rank spatial attention with mixed-precision training to yield a concise, hardware-friendly, and stable architecture. Evaluated across multiple PDE tasks, the approach reduces average error by over 17% compared to the next-best method while maintaining superior computational efficiency.
📝 Abstract
Neural operators have emerged as data-driven surrogates for solving partial differential equations (PDEs), and their success hinges on efficiently modeling the long-range, global coupling among spatial points induced by the underlying physics. In many PDE regimes, the induced global interaction kernels are empirically compressible, exhibiting rapid spectral decay that admits low-rank approximations. We leverage this observation to unify representative global mixing modules in neural operators under a shared low-rank template: compressing high-dimensional pointwise features into a compact latent space, processing global interactions within it, and reconstructing the global context back to spatial points. Guided by this view, we introduce Low-Rank Spatial Attention (LRSA) as a clean and direct instantiation of this template. Crucially, unlike prior approaches that often rely on non-standard aggregation or normalization modules, LRSA is built purely from standard Transformer primitives, i.e., attention, normalization, and feed-forward networks, yielding a concise block that is straightforward to implement and directly compatible with hardware-optimized kernels. In our experiments, such a simple construction is sufficient to achieve high accuracy, yielding an average error reduction of over 17\% relative to second-best methods, while remaining stable and efficient in mixed-precision training.