Token-Space Mask Prediction for Efficient Vision Transformer Segmentation

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

172K/year
🤖 AI Summary
This work proposes TokenMask, a novel segmentation framework that departs from conventional query-based Vision Transformer approaches which rely on explicit reconstruction of image-space feature maps—a process that incurs substantial computational redundancy and hinders deployment. Instead, TokenMask operates entirely in the query token space, generating mask logits directly through token affinity and performing interpolation in logit space. By integrating a ViT backbone, a token-space mask head, and TensorRT FP16 inference, the method significantly reduces both computational and memory overhead across multiple datasets and segmentation tasks while preserving accuracy. Notably, it achieves substantial acceleration on the Jetson AGX Orin platform, offering an efficient and streamlined architecture well-suited for embedded vision applications.
📝 Abstract
Query-based Vision Transformer segmentation models typically reconstruct dense spatial feature maps to predict masks, inheriting design patterns from convolutional architectures. We show that this explicit image-space reconstruction is not required. We introduce TokenMask, a token-space mask head that computes mask logits directly from query-token affinities and performs interpolation in logit space rather than feature space. This reformulation preserves the original linear scoring mechanism while simplifying the computational structure. Across diverse ViT backbones, datasets and segmentation tasks, TokenMask consistently improves efficiency over prior approaches by reducing computational and memory requirements while maintaining competitive accuracy, leading to tangible speedups on NVIDIA Jetson AGX Orin using TensorRT FP16 inference. Overall, TokenMask yields a simpler and more deployment-friendly design for embedded vision systems.
Problem

Research questions and friction points this paper is trying to address.

Vision Transformer
mask prediction
efficient segmentation
token-space
embedded vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-space mask prediction
Vision Transformer segmentation
Query-token affinity
Logit-space interpolation
Efficient deployment