Your ViT is Secretly an Image Segmentation Model

๐Ÿ“… 2025-03-24
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the challenge of achieving efficient image segmentation using a pure Vision Transformer (ViT) encoderโ€”without task-specific modules such as convolutional adapters, pixel decoders, or Transformer decoders. We propose the Encoder-only Mask Transformer (EoMT), which leverages large-scale ViTs (e.g., ViT-L) and employs joint self-supervised and supervised pretraining, coupled with only a lightweight mask prediction head and simple upsampling decoder. Our key finding is the first empirical demonstration that a sufficiently pretrained ViT encoder inherently acquires multi-scale inductive biases essential for segmentation. Evaluated on benchmarks including ADE20K, EoMT achieves state-of-the-art accuracy while accelerating inference by up to 4ร— over prior methods. It thus establishes a new Pareto-optimal trade-off between segmentation accuracy and computational efficiency.

Technology Category

Application Category

๐Ÿ“ Abstract
Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.
Problem

Research questions and friction points this paper is trying to address.

Repurpose plain ViT for image segmentation without task-specific components
Achieve high segmentation accuracy with large-scale models and pre-training
Optimize balance between segmentation accuracy and prediction speed
Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposes plain ViT for image segmentation
Uses large-scale models and pre-training
Achieves speed-accuracy balance without extra components