MF-Speech: Achieving Fine-Grained and Compositional Control in Speech Generation via Factor Disentanglement

📅 2025-11-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the deep entanglement among content, timbre, and emotion—and the coarse-grained control—in speech synthesis, this paper proposes MF-Speech: a framework that achieves high-purity disentanglement of these three factors via multi-objective optimization and Hierarchical Style-Adaptive Normalization (HSAN). It introduces a dynamic fusion mechanism and a dual-module architecture (MF-SpeechEncoder and MF-SpeechGenerator), enabling fine-grained, composable cross-factor control and cross-task transfer. On multi-factor compositional generation tasks, MF-Speech significantly outperforms prior methods: WER = 4.67%, SECS = 0.5685, correlation = 0.68, nMOS = 3.96, with subjective scores for emotion and timbre control achieving state-of-the-art performance. This work establishes a novel paradigm for universal speech representation and controllable speech synthesis.

Technology Category

Application Category

📝 Abstract
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.
Problem

Research questions and friction points this paper is trying to address.

Disentangling deeply entangled speech factors for better control
Overcoming coarse granularity in existing speech generation mechanisms
Achieving precise compositional control over content, timbre and emotion
Innovation

Methods, ideas, or system contributions that make the work stand out.

MF-SpeechEncoder purifies speech factors via multi-objective optimization
MF-SpeechGenerator enables control through dynamic fusion and HSAN
Framework achieves fine-grained compositional control in speech generation
🔎 Similar Papers
No similar papers found.
X
Xinyue Yu
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China.
Y
Youqing Fang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China.
Pingyu Wu
Pingyu Wu
University of Science and Technology of China
computer vision
G
Guoyang Ye
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China.
W
Wenbo Zhou
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China.
W
Weiming Zhang
School of Information Science and Technology, University of Science and Technology of China, Hefei 230026, China.
S
Song Xiao
Department of Electronic and Communication Engineering, Beijing Electronic Science and Technology Institute, Beijing 100070, China.