Efficient Interleaved Speech Modeling through Knowledge Distillation

📅 2025-06-30
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the deployment challenges and high latency of large models in speech generation, this paper proposes the TinyWave model family and a layer-wise alignment knowledge distillation framework—enabling, for the first time, compact models to jointly model pure speech and speech-text hybrid continuation. Methodologically, we compress a 2-billion-parameter multimodal Transformer via hidden-state matching, attention map alignment, and softened-logits distillation, yielding a lightweight model with one-third the parameter count. Trained on 50,000 hours of public audio, the distilled model achieves only a +1.4 perplexity increase on Libri-Light, while attaining 93–97% of the teacher’s accuracy on StoryCloze and SALMon—significantly outperforming same-scale baselines. This work establishes a new paradigm for efficient, real-time speech generation applicable to resource-constrained and interactive dialogue scenarios.

Technology Category

Application Category

📝 Abstract
Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.
Problem

Research questions and friction points this paper is trying to address.

Build compact speech models for constrained environments
Enable mixed speech-text generation efficiently
Optimize models for real-time hardware deployment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Layer-aligned distillation for compact models
TinyWave supports speech-text generation
Optimized for commodity hardware deployment
🔎 Similar Papers
No similar papers found.
M
Mohammadmahdi Nouriborji
Nlpie Research
Morteza Rohanian
Morteza Rohanian
University of Zurich