DSA-Tokenizer: Disentangled Semantic-Acoustic Tokenization via Flow Matching-based Hierarchical Fusion

📅 2026-01-14

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

Existing speech tokenizers struggle to effectively disentangle semantic content from acoustic style, limiting the controllable generation capabilities of large speech models. This work proposes a novel semantic–acoustic disentangled tokenization paradigm: discrete semantic tokens are learned under ASR supervision, while acoustic tokens are modeled via mel-spectrogram reconstruction guidance. A hierarchical flow-matching decoder is further introduced to enable high-fidelity speech synthesis. By jointly optimizing reconstruction and recombination objectives, the method significantly enhances both speech fidelity and the flexibility of combining semantic and acoustic representations. Experimental results validate the critical role of disentangled representations in advancing controllable and high-quality speech generation.

Technology Category

Application Category

📝 Abstract

Speech tokenizers serve as the cornerstone of discrete Speech Large Language Models (Speech LLMs). Existing tokenizers either prioritize semantic encoding, fuse semantic content with acoustic style inseparably, or achieve incomplete semantic-acoustic disentanglement. To achieve better disentanglement, we propose DSA-Tokenizer, which explicitly disentangles speech into discrete semantic and acoustic tokens via distinct optimization constraints. Specifically, semantic tokens are supervised by ASR to capture linguistic content, while acoustic tokens focus on mel-spectrograms restoration to encode style. To eliminate rigid length constraints between the two sequences, we introduce a hierarchical Flow-Matching decoder that further improve speech generation quality. Furthermore, We employ a joint reconstruction-recombination training strategy to enforce this separation. DSA-Tokenizer enables high fidelity reconstruction and flexible recombination through robust disentanglement, facilitating controllable generation in speech LLMs. Our analysis highlights disentangled tokenization as a pivotal paradigm for future speech modeling. Audio samples are avaialble at https://anonymous.4open.science/w/DSA_Tokenizer_demo/. The code and model will be made publicly available after the paper has been accepted.

Problem

Research questions and friction points this paper is trying to address.

speech tokenization

semantic-acoustic disentanglement

discrete speech representation

controllable speech generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled Tokenization

Flow Matching

Semantic-Acoustic Separation

Speech LLMs