StableMamba: Distillation-free Scaling of Large SSMs for Images and Videos

📅 2024-09-18

📈 Citations: 0

✨ Influential: 0

career value

247K/year

🤖 AI Summary

This work addresses the challenge of inefficient parameter scaling in large-scale State Space Models (SSMs) for image classification and action recognition. We propose the Mamba-Attention interleaved architecture—the first to overcome Mamba’s intrinsic scaling bottleneck—without relying on knowledge distillation. Our approach integrates an enhanced S6 selective scan SSM with self-attention mechanisms via interleaved stacking, thereby strengthening global contextual modeling. Additionally, we introduce a stability-aware training strategy to improve robustness. Evaluated on ImageNet-1K, Kinetics-400, and Something-Something-v2, our method achieves up to 1.7 percentage points higher top-1 accuracy than state-of-the-art Mamba-based models. It significantly enhances model scalability, generalization, and robustness against common distortions such as JPEG compression.

Technology Category

Application Category

📝 Abstract

State-space models (SSMs), exemplified by S4, have introduced a novel context modeling method by integrating state-space techniques into deep learning. However, they struggle with global context modeling due to their data-independent matrices. The Mamba model addressed this with data-dependent variants via the S6 selective-scan algorithm, enhancing context modeling, especially for long sequences. However, Mamba-based architectures are difficult to scale with respect to the number of parameters, which is a major limitation for vision applications. This paper addresses the scalability issue of large SSMs for image classification and action recognition without requiring additional techniques like knowledge distillation. We analyze the distinct characteristics of Mamba-based and Attention-based models, proposing a Mamba-Attention interleaved architecture that enhances scalability, robustness, and performance. We demonstrate that the stable and efficient interleaved architecture resolves the scalability issue of Mamba-based architectures for images and videos and increases robustness to common artifacts like JPEG compression. Our thorough evaluation on the ImageNet-1K, Kinetics-400 and Something-Something-v2 benchmarks demonstrates that our approach improves the accuracy of state-of-the-art Mamba-based architectures by up to $+1.7$.

Problem

Research questions and friction points this paper is trying to address.

Scaling Mamba-based SSMs for vision tasks without distillation

Improving global context modeling in SSMs for images/videos

Enhancing robustness to artifacts like JPEG compression

Innovation

Methods, ideas, or system contributions that make the work stand out.

Mamba-Attention interleaved architecture enhances scalability

Data-dependent S6 algorithm improves context modeling

No distillation needed for large SSM scaling

🔎 Similar Papers

SimPLR: A Simple and Plain Transformer for Efficient Object Detection and Segmentation