OuroMamba: A Data-Free Quantization Framework for Vision Mamba Models

📅 2025-03-13

📈 Citations: 0

✨ Influential: 0

career value

208K/year

🤖 AI Summary

This work addresses two key challenges in data-free quantization (DFQ) of Vision Mamba Models (VMMs): (1) semantic impoverishment of synthesized data due to state recurrence, and (2) dynamic activation outliers across time steps. We propose the first DFQ framework tailored for VMMs. Methodologically, it adopts a two-stage paradigm: (1) semantic-aware synthetic data generation via latent-space neighborhood contrastive learning; and (2) lightweight, time-step-wise adaptive outlier channel selection, integrated with mixed-precision quantization and custom GPU kernel optimization. Experiments demonstrate that our method surpasses data-driven post-training quantization (PTQ) baselines across multiple vision tasks, achieving state-of-the-art quantization accuracy. Moreover, it delivers up to 2.36× measured inference speedup on hardware, validating both its efficacy and efficiency.

Technology Category

Application Category

📝 Abstract

We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.

Problem

Research questions and friction points this paper is trying to address.

Enables data-free quantization for vision Mamba models.

Addresses challenges in capturing long-range interactions and dynamic outliers.

Achieves state-of-the-art performance with practical latency speedup.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Data-free post-training quantization for vision Mamba models

Two-stage framework: synthetic data generation and mixed-precision quantization

Dynamic outlier detection with threshold-based channel selection

🔎 Similar Papers

An empirical study of LLaMA3 quantization: from LLMs to MLLMs