Polyline Path Masked Attention for Vision Transformer

📅 2025-06-19

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

This work addresses the challenge of jointly optimizing global dependency modeling and spatial positional modeling in Vision Transformers (ViTs). We propose Polyline Path Masked Attention (PPMA), a novel attention mechanism that integrates the structural masking principle of Mamba2 into ViT’s self-attention framework. PPMA introduces, for the first time, a 2D polyline path scanning strategy to construct explicit spatial adjacency priors as learnable masks, accompanied by rigorous theoretical analysis and an efficient computation algorithm. On ADE20K semantic segmentation, PPMA achieves 52.3% mIoU—surpassing RMT-B by 0.3%—and establishes new state-of-the-art performance across diverse vision tasks, including image classification, object detection, and semantic segmentation. The implementation is publicly available.

Technology Category

Application Category

📝 Abstract

Global dependency modeling and spatial position modeling are two core issues of the foundational architecture design in current deep learning frameworks. Recently, Vision Transformers (ViTs) have achieved remarkable success in computer vision, leveraging the powerful global dependency modeling capability of the self-attention mechanism. Furthermore, Mamba2 has demonstrated its significant potential in natural language processing tasks by explicitly modeling the spatial adjacency prior through the structured mask. In this paper, we propose Polyline Path Masked Attention (PPMA) that integrates the self-attention mechanism of ViTs with an enhanced structured mask of Mamba2, harnessing the complementary strengths of both architectures. Specifically, we first ameliorate the traditional structured mask of Mamba2 by introducing a 2D polyline path scanning strategy and derive its corresponding structured mask, polyline path mask, which better preserves the adjacency relationships among image tokens. Notably, we conduct a thorough theoretical analysis on the structural characteristics of the proposed polyline path mask and design an efficient algorithm for the computation of the polyline path mask. Next, we embed the polyline path mask into the self-attention mechanism of ViTs, enabling explicit modeling of spatial adjacency prior. Extensive experiments on standard benchmarks, including image classification, object detection, and segmentation, demonstrate that our model outperforms previous state-of-the-art approaches based on both state-space models and Transformers. For example, our proposed PPMA-T/S/B models achieve 48.7%/51.1%/52.3% mIoU on the ADE20K semantic segmentation task, surpassing RMT-T/S/B by 0.7%/1.3%/0.3%, respectively. Code is available at https://github.com/zhongchenzhao/PPMA.

Problem

Research questions and friction points this paper is trying to address.

Enhancing global dependency modeling in Vision Transformers

Improving spatial adjacency with polyline path masks

Combining strengths of ViTs and Mamba2 architectures

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates ViTs self-attention with Mamba2 mask

Uses 2D polyline path scanning strategy

Enables explicit spatial adjacency modeling

🔎 Similar Papers

No similar papers found.