Training and Inference Efficiency of Encoder-Decoder Speech Models

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses critical efficiency bottlenecks in training and inference of mainstream encoder-decoder large speech models (e.g., Whisper, Seamless), identifying two root causes: padding waste induced by fixed-length sequence sampling and computational redundancy from autoregressive decoding. To mitigate these, we propose a dynamic-length batch sampling strategy that reduces padding overhead by over 50%, and introduce the first decoder-to-encoder parameter reallocation architecture—achieving a 3× inference speedup without accuracy degradation. Leveraging GPU performance profiling and RTFx (inverse real-time factor) evaluation, our approach enables a 5× increase in effective batch size, cutting training time by 50% under fixed compute budget or reducing GPU requirements to one-quarter. All code and models are publicly released.

Technology Category

Application Category

📝 Abstract

Attention encoder-decoder model architecture is the backbone of several recent top performing foundation speech models: Whisper, Seamless, OWSM, and Canary-1B. However, the reported data and compute requirements for their training are prohibitive for many in the research community. In this work, we focus on the efficiency angle and ask the questions of whether we are training these speech models efficiently, and what can we do to improve? We argue that a major, if not the most severe, detrimental factor for training efficiency is related to the sampling strategy of sequential data. We show that negligence in mini-batch sampling leads to more than 50% computation being spent on padding. To that end, we study, profile, and optimize Canary-1B training to show gradual improvement in GPU utilization leading up to 5x increase in average batch sizes versus its original training settings. This in turn allows us to train an equivalent model using 4x less GPUs in the same wall time, or leverage the original resources and train it in 2x shorter wall time. Finally, we observe that the major inference bottleneck lies in the autoregressive decoder steps. We find that adjusting the model architecture to transfer model parameters from the decoder to the encoder results in a 3x inference speedup as measured by inverse real-time factor (RTFx) while preserving the accuracy and compute requirements for convergence. The training code and models will be available as open-source.

Problem

Research questions and friction points this paper is trying to address.

Improving training efficiency of encoder-decoder speech models.

Reducing computation waste due to poor mini-batch sampling.

Optimizing inference speed by adjusting model architecture.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Optimized mini-batch sampling reduces padding computation.

Increased GPU utilization allows larger batch sizes.

Architecture adjustment speeds up inference by 3x.

🔎 Similar Papers

No similar papers found.