Your ViT is Secretly an Image Segmentation Model

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of achieving efficient image segmentation using a pure Vision Transformer (ViT) encoder—without task-specific modules such as convolutional adapters, pixel decoders, or Transformer decoders. We propose the Encoder-only Mask Transformer (EoMT), which leverages large-scale ViTs (e.g., ViT-L) and employs joint self-supervised and supervised pretraining, coupled with only a lightweight mask prediction head and simple upsampling decoder. Our key finding is the first empirical demonstration that a sufficiently pretrained ViT encoder inherently acquires multi-scale inductive biases essential for segmentation. Evaluated on benchmarks including ADE20K, EoMT achieves state-of-the-art accuracy while accelerating inference by up to 4× over prior methods. It thus establishes a new Pareto-optimal trade-off between segmentation accuracy and computational efficiency.

Technology Category

Application Category

📝 Abstract

Vision Transformers (ViTs) have shown remarkable performance and scalability across various computer vision tasks. To apply single-scale ViTs to image segmentation, existing methods adopt a convolutional adapter to generate multi-scale features, a pixel decoder to fuse these features, and a Transformer decoder that uses the fused features to make predictions. In this paper, we show that the inductive biases introduced by these task-specific components can instead be learned by the ViT itself, given sufficiently large models and extensive pre-training. Based on these findings, we introduce the Encoder-only Mask Transformer (EoMT), which repurposes the plain ViT architecture to conduct image segmentation. With large-scale models and pre-training, EoMT obtains a segmentation accuracy similar to state-of-the-art models that use task-specific components. At the same time, EoMT is significantly faster than these methods due to its architectural simplicity, e.g., up to 4x faster with ViT-L. Across a range of model sizes, EoMT demonstrates an optimal balance between segmentation accuracy and prediction speed, suggesting that compute resources are better spent on scaling the ViT itself rather than adding architectural complexity. Code: https://www.tue-mps.org/eomt/.

Problem

Research questions and friction points this paper is trying to address.

Repurpose plain ViT for image segmentation without task-specific components

Achieve high segmentation accuracy with large-scale models and pre-training

Optimize balance between segmentation accuracy and prediction speed

Innovation

Methods, ideas, or system contributions that make the work stand out.

Repurposes plain ViT for image segmentation

Uses large-scale models and pre-training

Achieves speed-accuracy balance without extra components

🔎 Similar Papers

Semantic Equitable Clustering: A Simple and Effective Strategy for Clustering Vision Tokens