Parallel Sequence Modeling via Generalized Spatial Propagation Network

📅 2025-01-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision Transformers flatten images into 1D sequences, discarding intrinsic 2D spatial structure and incurring high computational complexity—O(N²)—which hinders efficient high-resolution image processing. This paper introduces the Generalized Spatial Propagation Network (GSPN), a purely spatial attention mechanism explicitly designed for 2D image data: it eliminates positional encoding and instead employs line-scan-based dense pairwise connectivity coupled with stability- and context-constrained propagation to enable adaptive, input-dependent weight learning. Theoretically, sequence length reduces to √N, yielding substantial computational savings. GSPN achieves state-of-the-art performance on ImageNet classification, class-conditional generation, and text-to-image synthesis. At 16K resolution, it accelerates Softmax attention over SD-XL by over 84×. Its core innovation lies in the first realization of stable, position-embedding-free, computationally efficient, and spatially faithful 2D attention modeling.

Technology Category

Application Category

📝 Abstract
We present the Generalized Spatial Propagation Network (GSPN), a new attention mechanism optimized for vision tasks that inherently captures 2D spatial structures. Existing attention models, including transformers, linear attention, and state-space models like Mamba, process multi-dimensional data as 1D sequences, compromising spatial coherence and efficiency. GSPN overcomes these limitations by directly operating on spatially coherent image data and forming dense pairwise connections through a line-scan approach. Central to GSPN is the Stability-Context Condition, which ensures stable, context-aware propagation across 2D sequences and reduces the effective sequence length to $sqrt{N}$ for a square map with N elements, significantly enhancing computational efficiency. With learnable, input-dependent weights and no reliance on positional embeddings, GSPN achieves superior spatial fidelity and state-of-the-art performance in vision tasks, including ImageNet classification, class-guided image generation, and text-to-image generation. Notably, GSPN accelerates SD-XL with softmax-attention by over $84 imes$ when generating 16K images.
Problem

Research questions and friction points this paper is trying to address.

Spatial Information Loss
High-Definition Image Generation
Efficiency Improvement
Innovation

Methods, ideas, or system contributions that make the work stand out.

GSPN
Spatial Information Retention
Stability-Context Conditioning
🔎 Similar Papers
H
Hongjun Wang
NVIDIA, The University of Hong Kong
Wonmin Byeon
Wonmin Byeon
NVIDIA Research
Machine LearningComputer VisionArtificial Intelligence
Jiarui Xu
Jiarui Xu
University of Sydney
MLOps
J
Jinwei Gu
NVIDIA
K
Ka Chun Cheung
NVIDIA
X
Xiaolong Wang
NVIDIA, University of California, San Diego
K
Kai Han
The University of Hong Kong
Jan Kautz
Jan Kautz
Vice President of Research, NVIDIA Research
Computer VisionMachine LearningVisual Computing
Sifei Liu
Sifei Liu
NVIDIA
Computer VisionMachine Learning