🤖 AI Summary
This work addresses the challenge of unifying speech analysis, controllable synthesis, and generative modeling within a single framework. We propose AnCoGen—a unified, isomorphic architecture based on masked autoencoding—that jointly models speech analysis (e.g., speaker identity, pitch, linguistic content, loudness estimation), attribute-disentangled control, and high-fidelity waveform generation for the first time. Leveraging multi-task joint training and attribute-conditional generation, AnCoGen enables fine-grained, multi-dimensional semantic editing of speech attributes. Extensive experiments demonstrate state-of-the-art performance on voice re-synthesis, pitch estimation and modification, and speech enhancement—validating its strong cross-task generalization and precise controllability. AnCoGen establishes a novel paradigm for speech representation learning and controllable generation, bridging traditionally disjoint objectives in speech processing.
📝 Abstract
This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.