Separators in Enhancing Autoregressive Pretraining for Vision Mamba

πŸ“… 2026-03-04
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This work addresses the limitation of existing autoregressive pretraining methods, which are constrained to short sequences and thus fail to fully exploit Vision Mamba’s capacity for long-range modeling. To overcome this, the authors propose STAR, a novel approach that introduces a unified separator mechanism to concatenate multiple images into an ultra-long sequence while preserving their original resolution. This enables autoregressive pretraining with input lengths extended up to four times longer than previous methods. Built upon the Mamba architecture, STAR effectively unleashes the state space model’s ability to capture long-range dependencies. Experimental results demonstrate that STAR-B achieves 83.5% top-1 accuracy on ImageNet-1K, setting a new performance benchmark within the Vision Mamba family.

Technology Category

Application Category

πŸ“ Abstract
The state space model Mamba has recently emerged as a promising paradigm in computer vision, attracting significant attention due to its efficient processing of long sequence tasks. Mamba's inherent causal mechanism renders it particularly suitable for autoregressive pretraining. However, current autoregressive pretraining methods are constrained to short sequence tasks, failing to fully exploit Mamba's prowess in handling extended sequences. To address this limitation, we introduce an innovative autoregressive pretraining method for Vision Mamba that substantially extends the input sequence length. We introduce new \textbf{S}epara\textbf{T}ors for \textbf{A}uto\textbf{R}egressive pretraining to demarcate and differentiate between different images, known as \textbf{STAR}. Specifically, we insert identical separators before each image to demarcate its inception. This strategy enables us to quadruple the input sequence length of Vision Mamba while preserving the original dimensions of the dataset images. Employing this long sequence pretraining technique, our STAR-B model achieved an impressive accuracy of 83.5\% on ImageNet-1k, which is highly competitive in Vision Mamba. These results underscore the potential of our method in enhancing the performance of vision models through improved leveraging of long-range dependencies.
Problem

Research questions and friction points this paper is trying to address.

autoregressive pretraining
long sequence
Vision Mamba
state space model
sequence length limitation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Mamba
autoregressive pretraining
sequence length extension
separators
long-range dependencies
πŸ”Ž Similar Papers
No similar papers found.
H
Hanpeng Liu
Huazhong University of Science and Technology
Z
Zidan Wang
Huazhong University of Science and Technology
S
Shuoxi Zhang
Huazhong University of Science and Technology
Kaiyuan Gao
Kaiyuan Gao
Huazhong University of Science and Technology
Visual GenerationAI4Science
Kun He
Kun He
Professor, Huazhong University of Science and Technology
AI SecurityGraph data miningOptimizationDeep learningAI4Sci