Do VLMs Need Vision Transformers? Evaluating State Space Models as Vision Encoders

📅 2026-03-19

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

This study investigates whether Vision Transformers (ViTs) are indispensable as visual encoders in vision-language models (VLMs), presenting the first systematic evaluation of state space models (SSMs) as alternative backbones. Under consistent ImageNet-1K initialization conditions and using a lightweight connector to interface with a large language model, the authors compare SSMs and ViTs on visual question answering and localization tasks, while also examining the impact of dense-task fine-tuning and training stability. Results demonstrate that SSMs achieve comparable or superior performance to ViTs at smaller model scales; dense fine-tuning consistently enhances performance across architectures; and the proposed stabilization strategy significantly improves robustness for both backbone types. The work further reveals a weak correlation between ImageNet accuracy and downstream VLM performance, offering new insights for visual encoder design.

Technology Category

Application Category

📝 Abstract

Large vision--language models (VLMs) often use a frozen vision backbone, whose image features are mapped into a large language model through a lightweight connector. While transformer-based encoders are the standard visual backbone, we ask whether state space model (SSM) vision backbones can be a strong alternative. We systematically evaluate SSM vision backbones for VLMs in a controlled setting. Under matched ImageNet-1K initialization, the SSM backbone achieves the strongest overall performance across both VQA and grounding/localization. We further adapt both SSM and ViT-family backbones with detection or segmentation training and find that dense-task tuning generally improves performance across families; after this adaptation, the SSM backbone remains competitive while operating at a substantially smaller model scale. We further observe that (i) higher ImageNet accuracy or larger backbones do not reliably translate into better VLM performance, and (ii) some visual backbones are unstable in localization. Based on these findings, we propose stabilization strategies that improve robustness for both backbone families and highlight SSM backbones as a strong alternative to transformer-based vision encoders in VLMs.

Problem

Research questions and friction points this paper is trying to address.

Vision-Language Models

State Space Models

Vision Transformers

Visual Backbone

Model Stability

Innovation

Methods, ideas, or system contributions that make the work stand out.

State Space Models

Vision-Language Models

Vision Encoders