🤖 AI Summary
Existing vision models struggle to explicitly model the spatial propagation mechanism of semantic information, making it difficult to simultaneously capture global structure and high-frequency details. This work proposes the first integration of the underdamped wave equation into visual modeling, introducing a Wave Propagation Operator (WPO) that explicitly decouples spatial frequencies from propagation dynamics through an internal propagation time—corresponding to network depth—and derives a closed-form solution. The resulting WaveFormer architecture achieves O(N log N) complexity and serves as a drop-in replacement for ViT or CNN backbones. Experiments demonstrate that the method matches the accuracy of attention-based models across image classification, object detection, and semantic segmentation tasks, while improving throughput by up to 1.6× and reducing FLOPs by 30%.
📝 Abstract
Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.