UniVoice: Unifying Autoregressive ASR and Flow-Matching based TTS with Large Language Models

📅 2025-10-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Current ASR and TTS systems are typically modeled separately; while discrete speech tokenization enables joint modeling, it suffers from information loss that limits performance. This paper introduces the first unified large language model framework based on continuous speech representations for end-to-end joint ASR and TTS modeling. Our approach addresses the fundamental trade-off between recognition and generation through two core innovations: (1) a dual-attention mechanism that dynamically switches between causal and bidirectional masking to jointly optimize autoregressive speech recognition and flow-matching-based speech synthesis; and (2) a text-prefixed speech completion strategy enabling high-fidelity zero-shot voice cloning. Experiments demonstrate that our framework matches or surpasses state-of-the-art single-task models on both ASR and zero-shot text-to-speech benchmarks—marking the first empirical validation of the effectiveness and superiority of continuous-representation-based unified speech understanding and generation.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) have demonstrated promising performance in both automatic speech recognition (ASR) and text-to-speech (TTS) systems, gradually becoming the mainstream approach. However, most current approaches address these tasks separately rather than through a unified framework. This work aims to integrate these two tasks into one unified model. Although discrete speech tokenization enables joint modeling, its inherent information loss limits performance in both recognition and generation. In this work, we present UniVoice, a unified LLM framework through continuous representations that seamlessly integrates speech recognition and synthesis within a single model. Our approach combines the strengths of autoregressive modeling for speech recognition with flow matching for high-quality generation. To mitigate the inherent divergence between autoregressive and flow-matching models, we further design a dual attention mechanism, which switches between a causal mask for recognition and a bidirectional attention mask for synthesis. Furthermore, the proposed text-prefix-conditioned speech infilling method enables high-fidelity zero-shot voice cloning. Experimental results demonstrate that our method can achieve or exceed current single-task modeling methods in both ASR and zero-shot TTS tasks. This work explores new possibilities for end-to-end speech understanding and generation.
Problem

Research questions and friction points this paper is trying to address.

Unifying speech recognition and synthesis in one model
Overcoming information loss from discrete speech tokenization
Integrating autoregressive and flow-matching models for speech tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified LLM framework with continuous speech representations
Combines autoregressive ASR and flow-matching TTS
Dual attention mechanism switches between recognition and synthesis
🔎 Similar Papers
Wenhao Guan
Wenhao Guan
Xiamen University
speech
Zhikang Niu
Zhikang Niu
Shanghai Jiao Tong University
Speech Synthesis
Ziyue Jiang
Ziyue Jiang
Zhejiang University
Speech Synthesis
K
Kaidi Wang
Xiamen University
P
Peijie Chen
Xiamen University
Q
Qingyang Hong
Xiamen University
L
Lin Li
Xiamen University
X
Xie Chen
Shanghai Innovation Institute, Shanghai Jiao Tong University