How Redundant Is the Transformer Stack in Speech Representation Models?

📅 2024-09-10
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study systematically reveals substantial inter-layer redundancy in Transformer models for automatic speech recognition (ASR), where consecutive layers exhibit highly similar representations—quantitatively confirmed via cosine similarity, Centered Kernel Alignment (CKA), and Mutual Nearest Neighbors (MNN) metrics. Method: Leveraging this redundancy, we propose two fine-tuning-free compression paradigms: (1) structured layer pruning guided by representation similarity, reducing layer count by 40% while retaining >95% predictive performance; and (2) end-to-end knowledge distillation, transferring knowledge from deep teacher models to shallow student models. Contribution/Results: Our approach achieves 95–98% parameter reduction and 94% inference speedup, with <0.5% relative performance degradation across mainstream ASR benchmarks. This work provides the first empirical evidence of extreme compressibility in speech Transformers and delivers a plug-and-play, fine-tuning-free lightweighting solution suitable for resource-constrained deployment.

Technology Category

Application Category

📝 Abstract
Self-supervised speech representation models, particularly those leveraging transformer architectures, have demonstrated remarkable performance across various tasks such as speech recognition, speaker identification, and emotion detection. Recent studies on transformer models revealed a high redundancy between layers and the potential for significant pruning, which we will investigate here for transformer-based speech representation models. We perform a detailed analysis of layer similarity in speech representation models using three similarity metrics: cosine similarity, centered kernel alignment, and mutual nearest-neighbor alignment. Our findings reveal a block-like structure of high similarity, suggesting two main processing steps and significant redundancy of layers. We demonstrate the effectiveness of pruning transformer-based speech representation models without the need for post-training, achieving up to 40% reduction in transformer layers while maintaining over 95% of the model's predictive capacity. Furthermore, we employ a knowledge distillation method to substitute the entire transformer stack with mimicking layers, reducing the network size 95-98% and the inference time by up to 94%. This substantial decrease in computational load occurs without considerable performance loss, suggesting that the transformer stack is almost completely redundant for downstream applications of speech representation models.
Problem

Research questions and friction points this paper is trying to address.

Speech Recognition
Transformer Model
Redundancy Reduction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer Redundancy
Knowledge Distillation
Model Optimization
🔎 Similar Papers
No similar papers found.
T
Teresa Dorszewski
Technical University of Denmark, DTU Compute, Section for Cognitive Systems
A
Albert Kjoller Jacobsen
Technical University of Denmark, DTU Compute, Section for Cognitive Systems
L
Lenka Tětková
Technical University of Denmark, DTU Compute, Section for Cognitive Systems
Lars Kai Hansen
Lars Kai Hansen
Professor, Cognitive Systems, DTU Compute, Technical University of Denmark
Machine learningAIneuroimagingcognitive systemssignal processing