🤖 AI Summary
Large multilingual speech translation models often suffer from excessive parameter counts, hindering simultaneous achievement of high inference efficiency and translation quality. To address this, we propose a parasitic dual-scale modeling paradigm centered on the Key-Value Sparse Prediction Network (KVSPN), integrated with enhanced speculative decoding, structured pruning, and knowledge distillation. Our method significantly accelerates inference without compromising accuracy: KVSPN alone achieves a 40% speedup, while the full pipeline—including distillation—yields a 2.6× inference acceleration over Whisper Medium, with superior BLEU and TER scores. Evaluated across six major languages, our approach establishes new state-of-the-art results in both translation quality and latency, marking the first work to jointly optimize performance and efficiency in multilingual speech translation. This enables practical, cost-effective on-device deployment.
📝 Abstract
Recent advancements in speech-to-text translation have led to the development of multilingual models capable of handling multiple language pairs simultaneously. However, these unified models often suffer from large parameter sizes, making it challenging to balance inference efficiency and performance, particularly in local deployment scenarios. We propose an innovative Parasitic Dual-Scale Approach, which combines an enhanced speculative sampling method with model compression and knowledge distillation techniques. Building on the Whisper Medium model, we enhance it for multilingual speech translation into whisperM2M, and integrate our novel KVSPN module, achieving state-of-the-art (SOTA) performance across six popular languages with improved inference efficiency. KVSPN enables a 40% speedup with no BLEU score degradation. Combined with distillation methods, it represents a 2.6$ imes$ speedup over the original Whisper Medium with superior performance.