🤖 AI Summary
To address the deep entanglement among content, timbre, and emotion—and the coarse-grained control—in speech synthesis, this paper proposes MF-Speech: a framework that achieves high-purity disentanglement of these three factors via multi-objective optimization and Hierarchical Style-Adaptive Normalization (HSAN). It introduces a dynamic fusion mechanism and a dual-module architecture (MF-SpeechEncoder and MF-SpeechGenerator), enabling fine-grained, composable cross-factor control and cross-task transfer. On multi-factor compositional generation tasks, MF-Speech significantly outperforms prior methods: WER = 4.67%, SECS = 0.5685, correlation = 0.68, nMOS = 3.96, with subjective scores for emotion and timbre control achieving state-of-the-art performance. This work establishes a novel paradigm for universal speech representation and controllable speech synthesis.
📝 Abstract
Generating expressive and controllable human speech is one of the core goals of generative artificial intelligence, but its progress has long been constrained by two fundamental challenges: the deep entanglement of speech factors and the coarse granularity of existing control mechanisms. To overcome these challenges, we have proposed a novel framework called MF-Speech, which consists of two core components: MF-SpeechEncoder and MF-SpeechGenerator. MF-SpeechEncoder acts as a factor purifier, adopting a multi-objective optimization strategy to decompose the original speech signal into highly pure and independent representations of content, timbre, and emotion. Subsequently, MF-SpeechGenerator functions as a conductor, achieving precise, composable and fine-grained control over these factors through dynamic fusion and Hierarchical Style Adaptive Normalization (HSAN). Experiments demonstrate that in the highly challenging multi-factor compositional speech generation task, MF-Speech significantly outperforms current state-of-the-art methods, achieving a lower word error rate (WER=4.67%), superior style control (SECS=0.5685, Corr=0.68), and the highest subjective evaluation scores(nMOS=3.96, sMOS_emotion=3.86, sMOS_style=3.78). Furthermore, the learned discrete factors exhibit strong transferability, demonstrating their significant potential as a general-purpose speech representation.