Llama-Mimi: Speech Language Models with Interleaved Semantic and Acoustic Tokens

📅 2025-09-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the semantic-acoustic disentanglement in speech language models, which causes audio distortion, speaker identity drift, and poor long-range coherence. To this end, we propose a semantic-acoustic token interleaving modeling framework. Methodologically, we design a unified discrete tokenizer that alternately embeds semantic and acoustic tokens into a single Transformer decoder, enabling joint generation and understanding. We systematically analyze the trade-off between quantizer count and performance in both audio quality and linguistic modeling. Furthermore, we introduce the first automatic speech content evaluation paradigm leveraging large language models as judges (LLM-as-a-Judge). Experiments demonstrate state-of-the-art performance across acoustic consistency, speaker identity preservation, speech fidelity, and semantic accuracy—significantly improving naturalness and coherence in long-form speech synthesis.

Technology Category

Application Category

📝 Abstract
We propose Llama-Mimi, a speech language model that uses a unified tokenizer and a single Transformer decoder to jointly model sequences of interleaved semantic and acoustic tokens. Comprehensive evaluation shows that Llama-Mimi achieves state-of-the-art performance in acoustic consistency and possesses the ability to preserve speaker identity. Our analysis further demonstrates that increasing the number of quantizers improves acoustic fidelity but degrades linguistic performance, highlighting the inherent challenge of maintaining long-term coherence. We additionally introduce an LLM-as-a-Judge-based evaluation to assess the spoken content quality of generated outputs. Our models, code, and speech samples are publicly available.
Problem

Research questions and friction points this paper is trying to address.

Modeling interleaved semantic and acoustic tokens
Balancing acoustic fidelity with linguistic performance
Evaluating spoken content quality in generated outputs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified tokenizer for semantic and acoustic tokens
Single Transformer decoder for joint modeling
LLM-as-a-Judge evaluation for content quality
🔎 Similar Papers
2024-07-22arXiv.orgCitations: 4