🤖 AI Summary
This work addresses two key challenges in data-free quantization (DFQ) of Vision Mamba Models (VMMs): (1) semantic impoverishment of synthesized data due to state recurrence, and (2) dynamic activation outliers across time steps. We propose the first DFQ framework tailored for VMMs. Methodologically, it adopts a two-stage paradigm: (1) semantic-aware synthetic data generation via latent-space neighborhood contrastive learning; and (2) lightweight, time-step-wise adaptive outlier channel selection, integrated with mixed-precision quantization and custom GPU kernel optimization. Experiments demonstrate that our method surpasses data-driven post-training quantization (PTQ) baselines across multiple vision tasks, achieving state-of-the-art quantization accuracy. Moreover, it delivers up to 2.36× measured inference speedup on hardware, validating both its efficacy and efficiency.
📝 Abstract
We present OuroMamba, the first data-free post-training quantization (DFQ) method for vision Mamba-based models (VMMs). We identify two key challenges in enabling DFQ for VMMs, (1) VMM's recurrent state transitions restricts capturing of long-range interactions and leads to semantically weak synthetic data, (2) VMM activations exhibit dynamic outlier variations across time-steps, rendering existing static PTQ techniques ineffective. To address these challenges, OuroMamba presents a two-stage framework: (1) OuroMamba-Gen to generate semantically rich and meaningful synthetic data. It applies contrastive learning on patch level VMM features generated through neighborhood interactions in the latent state space, (2) OuroMamba-Quant to employ mixed-precision quantization with lightweight dynamic outlier detection during inference. In specific, we present a thresholding based outlier channel selection strategy for activations that gets updated every time-step. Extensive experiments across vision and generative tasks show that our data-free OuroMamba surpasses existing data-driven PTQ techniques, achieving state-of-the-art performance across diverse quantization settings. Additionally, we implement efficient GPU kernels to achieve practical latency speedup of up to 2.36x. Code will be released soon.