WaveFormer: Frequency-Time Decoupled Vision Modeling with Wave Equation

📅 2026-01-13

📈 Citations: 0

✨ Influential: 0

career value

212K/year

🤖 AI Summary

Existing vision models struggle to explicitly model the spatial propagation mechanism of semantic information, making it difficult to simultaneously capture global structure and high-frequency details. This work proposes the first integration of the underdamped wave equation into visual modeling, introducing a Wave Propagation Operator (WPO) that explicitly decouples spatial frequencies from propagation dynamics through an internal propagation time—corresponding to network depth—and derives a closed-form solution. The resulting WaveFormer architecture achieves O(N log N) complexity and serves as a drop-in replacement for ViT or CNN backbones. Experiments demonstrate that the method matches the accuracy of attention-based models across image classification, object detection, and semantic segmentation tasks, while improving throughput by up to 1.6× and reducing FLOPs by 30%.

Technology Category

Application Category

📝 Abstract

Vision modeling has advanced rapidly with Transformers, whose attention mechanisms capture visual dependencies but lack a principled account of how semantic information propagates spatially. We revisit this problem from a wave-based perspective: feature maps are treated as spatial signals whose evolution over an internal propagation time (aligned with network depth) is governed by an underdamped wave equation. In this formulation, spatial frequency-from low-frequency global layout to high-frequency edges and textures-is modeled explicitly, and its interaction with propagation time is controlled rather than implicitly fixed. We derive a closed-form, frequency-time decoupled solution and implement it as the Wave Propagation Operator (WPO), a lightweight module that models global interactions in O(N log N) time-far lower than attention. Building on WPO, we propose a family of WaveFormer models as drop-in replacements for standard ViTs and CNNs, achieving competitive accuracy across image classification, object detection, and semantic segmentation, while delivering up to 1.6x higher throughput and 30% fewer FLOPs than attention-based alternatives. Furthermore, our results demonstrate that wave propagation introduces a complementary modeling bias to heat-based methods, effectively capturing both global coherence and high-frequency details essential for rich visual semantics. Codes are available at: https://github.com/ZishanShu/WaveFormer.

Problem

Research questions and friction points this paper is trying to address.

vision modeling

semantic propagation

spatial signals

frequency-time interaction

visual dependencies

Innovation

Methods, ideas, or system contributions that make the work stand out.

Wave Equation

Frequency-Time Decoupling

Wave Propagation Operator