🤖 AI Summary
To address the deployment challenges and high latency of large models in speech generation, this paper proposes the TinyWave model family and a layer-wise alignment knowledge distillation framework—enabling, for the first time, compact models to jointly model pure speech and speech-text hybrid continuation. Methodologically, we compress a 2-billion-parameter multimodal Transformer via hidden-state matching, attention map alignment, and softened-logits distillation, yielding a lightweight model with one-third the parameter count. Trained on 50,000 hours of public audio, the distilled model achieves only a +1.4 perplexity increase on Libri-Light, while attaining 93–97% of the teacher’s accuracy on StoryCloze and SALMon—significantly outperforming same-scale baselines. This work establishes a new paradigm for efficient, real-time speech generation applicable to resource-constrained and interactive dialogue scenarios.
📝 Abstract
Current speech language models exceed the size and latency constraints of many deployment environments. We build compact, expressive speech generation models through layer-aligned distillation, matching hidden states, attention maps, and softened logits to compress large multimodal transformers by 3x with minimal loss in performance. We introduce TinyWave, a family of 2B-parameter models for speech-to-speech and interleaved speech-text generation, trained on 50,000 hours of public audio. TinyWave supports (i) speech-only generation using phonetic or expressive tokens and (ii) mixed speech-text continuations. Evaluation on Libri-Light shows TinyWave within 1.4 normalized perplexity points of its teacher. Accuracy on spoken StoryCloze and SALMon reaches 93-97% of the teacher's performance, outperforming size-matched baselines. These models are optimized for deployment on commodity hardware, enabling applications in real-time conversational agents, assistive technologies, and low-resource environments. We release models, training code, and evaluation scripts to support reproducible research on compact, expressive speech generation.