TransMamba: Fast Universal Architecture Adaption from Transformers to Mamba

📅 2025-02-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the high training cost and poor reusability of Transformer knowledge in linear-complexity state space models (SSMs) like Mamba, this paper proposes an efficient cross-architecture knowledge transfer framework. Methodologically, it introduces three key components: (i) feature-aligned latent-space projection, (ii) weight sub-cloning with adaptive bidirectional distillation (WSAB), and (iii) a cross-Mamba language-aware module—enabling robust Transformer-to-Mamba transfer despite layer count mismatches and incorporating cross-modal features. Our contribution is the first end-to-end knowledge transfer from Transformers to Mamba. Empirically, on image classification, visual question answering, and text-video retrieval, the transferred Mamba models surpass their from-scratch counterparts across all tasks using ≤75% of the training data—substantially reducing computational overhead while significantly enhancing multimodal generalization.

Technology Category

Application Category

📝 Abstract
Transformers have been favored in both uni-modal and multi-modal foundation models for their flexible scalability in attention modules. Consequently, a number of pre-trained Transformer models, e.g., LLaVA, CLIP, and DEIT, are publicly available. Recent research has introduced subquadratic architectures like Mamba, which enables global awareness with linear complexity. Nevertheless, training specialized subquadratic architectures from scratch for certain tasks is both resource-intensive and time-consuming. As a motivator, we explore cross-architecture training to transfer the ready knowledge in existing Transformer models to alternative architecture Mamba, termed TransMamba. Our approach employs a two-stage strategy to expedite training new Mamba models, ensuring effectiveness in across uni-modal and cross-modal tasks. Concerning architecture disparities, we project the intermediate features into an aligned latent space before transferring knowledge. On top of that, a Weight Subcloning and Adaptive Bidirectional distillation method (WSAB) is introduced for knowledge transfer without limitations on varying layer counts. For cross-modal learning, we propose a cross-Mamba module that integrates language awareness into Mamba's visual features, enhancing the cross-modal interaction capabilities of Mamba architecture. Despite using less than 75% of the training data typically required for training from scratch, TransMamba boasts substantially stronger performance across various network architectures and downstream tasks, including image classification, visual question answering, and text-video retrieval. The code will be publicly available.
Problem

Research questions and friction points this paper is trying to address.

Transfer knowledge from Transformer to Mamba
Reduce training time and resource usage
Enhance cross-modal interaction in Mamba
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage training strategy
Weight Subcloning and Adaptive Bidirectional distillation
Cross-Mamba module integration
🔎 Similar Papers
No similar papers found.