Joint Fine-tuning and Conversion of Pretrained Speech and Language Models towards Linear Complexity

📅 2024-10-09
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF

career value

199K/year
🤖 AI Summary
The quadratic computational complexity of Transformers poses prohibitive inference overhead in non-text domains such as speech. Method: This paper proposes Cross-Architecture Layerwise Distillation (CALD), the first framework enabling end-to-end architectural conversion and task-specific fine-tuning from Transformers to linear-complexity models (e.g., Linformer, Mamba). CALD jointly optimizes structural migration and capability retention via hierarchical parameter mapping and multi-stage target-model guidance during knowledge distillation. Contribution/Results: Experiments across language modeling, NLP, and speech tasks demonstrate that CALD fully recovers the original Transformer’s performance. A systematic ablation confirms that target-model guidance is critical for restoring linear-model performance. This work establishes a reusable, cross-architecture distillation paradigm for lightweighting large models beyond text.

Technology Category

Application Category

📝 Abstract
Architectures such as Linformer and Mamba have recently emerged as competitive linear time replacements for transformers. However, corresponding large pretrained models are often unavailable, especially in non-text domains. To remedy this, we present a Cross-Architecture Layerwise Distillation (CALD) approach that jointly converts a transformer model to a linear time substitute and fine-tunes it to a target task. We also compare several means to guide the fine-tuning to optimally retain the desired inference capability from the original model. The methods differ in their use of the target model and the trajectory of the parameters. In a series of empirical studies on language processing, language modeling, and speech processing, we show that CALD can effectively recover the result of the original model, and that the guiding strategy contributes to the result. Some reasons for the variation are suggested.
Problem

Research questions and friction points this paper is trying to address.

Linear complexity model conversion
Cross-architecture layerwise distillation
Retaining original model inference capability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Architecture Layerwise Distillation
Linear time model conversion
Fine-tuning guidance strategy
🔎 Similar Papers
No similar papers found.