Effective and Efficient One-pass Compression of Speech Foundation Models Using Sparsity-aware Self-pinching Gates

📅 2025-05-28

📈 Citations: 0

✨ Influential: 0

career value

225K/year

🤖 AI Summary

To address the challenge of efficient one-shot compression for speech foundation models, this paper proposes a single-stage joint optimization framework that enables neuron-level fine-grained pruning and parameter fine-tuning via first-order end-to-end synchronous training. Our key contributions are: (1) a self-compressing gate mechanism with inter-layer weight sharing, controlled by a single learnable threshold to achieve high-precision, controllable sparsity; and (2) integration of sparsity-aware gating, hierarchical weight reuse, and fine-grained pruning. Evaluated on wav2vec 2.0-base and HuBERT-large, our method achieves 65% and 60% parameter reduction, respectively, while maintaining test-clean WER at 7.05%—with no statistically significant degradation—and reducing compression time by over 25%. To the best of our knowledge, this is the first approach to fully unify pruning and fine-tuning into a single-stage, end-to-end optimization, achieving a favorable trade-off among efficiency, accuracy, and deployment practicality.

Technology Category

Application Category

📝 Abstract

This paper presents a novel approach for speech foundation models compression that tightly integrates model pruning and parameter update into a single stage. Highly compact layer-level tied self-pinching gates each containing only a single learnable threshold are jointly trained with uncompressed models and used in fine-grained neuron level pruning. Experiments conducted on the LibriSpeech-100hr corpus suggest that our approach reduces the number of parameters of wav2vec2.0-base and HuBERT-large models by 65% and 60% respectively, while incurring no statistically significant word error rate (WER) increase on the test-clean dataset. Compared to previously published methods on the same task, our approach not only achieves the lowest WER of 7.05% on the test-clean dataset under a comparable model compression ratio of 4.26x, but also operates with at least 25% less model compression time.

Problem

Research questions and friction points this paper is trying to address.

Compress speech foundation models efficiently

Integrate pruning and parameter update jointly

Reduce parameters without performance loss

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sparsity-aware self-pinching gates for compression

Single-stage pruning and parameter update

Fine-grained neuron-level pruning with learnable thresholds

🔎 Similar Papers

Compressing Transformer-based self-supervised models for speech processing

2022-11-17arXiv.orgCitations: 6

TikTok

San Jose, California

Software Engineer Intern (AI Model Optimization) - 2026 Summer (BS/MS)

TikTok

Seattle, Washington

Authors to Follow