Compressing Transformer-based self-supervised models for speech processing

📅 2022-11-17
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Large-scale self-supervised speech Transformer models suffer from high parameter counts and computational overhead, hindering practical deployment; moreover, compression evaluation lacks standardized benchmarks and multi-dimensional metrics (e.g., latency, parameters, MACs), impeding fair comparison and adoption. This work systematically evaluates mainstream compression techniques—including pruning, low-rank approximation, and knowledge distillation—under diverse compression ratios, establishing a standardized evaluation framework. We first identify the critical role of diagonal attention heads in speech SSL models and propose a lightweight, multi-technique co-compression method that achieves superior accuracy–efficiency trade-offs over recent approaches. Experiments reveal that basic compression methods themselves constitute strong baselines. We release reproducible Pareto-optimal trade-off curves and an open-source codebase, advancing model compression from a deployment tool to a new paradigm for analytical model understanding.
📝 Abstract
Despite the success of Transformers in self- supervised learning with applications to various downstream tasks, the computational cost of training and inference remains a major challenge for applying these models to a wide spectrum of devices. Several isolated attempts have been made to compress Transformers, but the settings and metrics are different across studies. Trade-off at various compression rates are also largely missing in prior work, making it difficult to compare compression techniques. In this work, we aim to provide context for the isolated results, studying several commonly used compression techniques, including weight pruning, head pruning, low-rank approximation, and knowledge distillation. We report trade- off at various compression rate, including wall-clock time, the number of parameters, and the number of multiply-accumulate operations. Our results show that compared to recent approaches, basic compression techniques are strong baselines. We further present several applications of our results, revealing properties of Transformers, such as the significance of diagonal attention heads. In addition, our results lead to a simple combination of compression techniques that improves trade-off over recent approaches. We hope the results would promote more diverse comparisons among model compression techniques and promote the use of model compression as a tool for analyzing models. Our code of compressing speech self-supervised model is available at https://github.com/nervjack2/Speech-SSL-Compression/.
Problem

Research questions and friction points this paper is trying to address.

Evaluating compression methods for self-supervised speech Transformers
Comparing practical effectiveness using consistent metrics
Providing deployment guidance for compressed speech models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Comprehensive study of four compression methods
Evaluation under parameter count and operations
Comparison of recent compression techniques
🔎 Similar Papers
Tzu-Quan Lin
Tzu-Quan Lin
National Taiwan University
Self-Supervised LearningSpoken Language ModelsModel CompressionInterpretability
T
Tsung-Huan Yang
Academia Sinica, Taiwan
C
Chun-Yao Chang
University of California, Los Angeles, United States
Kuang-Ming Chen
Kuang-Ming Chen
University of Washington
Language ModelNatural Language ProcessingSpeech ProcessingModel Compression
T
Tzu-hsun Feng
Graduate Institute of Communication Engineering, National Taiwan University, Taiwan
Hung-yi Lee
Hung-yi Lee
National Taiwan University
deep learningspoken language understandingspeech processing
H
Hao Tang
University of Edinburgh, United Kingdom