AnCoGen: Analysis, Control and Generation of Speech with a Masked Autoencoder

📅 2025-01-09

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

This work addresses the challenge of unifying speech analysis, controllable synthesis, and generative modeling within a single framework. We propose AnCoGen—a unified, isomorphic architecture based on masked autoencoding—that jointly models speech analysis (e.g., speaker identity, pitch, linguistic content, loudness estimation), attribute-disentangled control, and high-fidelity waveform generation for the first time. Leveraging multi-task joint training and attribute-conditional generation, AnCoGen enables fine-grained, multi-dimensional semantic editing of speech attributes. Extensive experiments demonstrate state-of-the-art performance on voice re-synthesis, pitch estimation and modification, and speech enhancement—validating its strong cross-task generalization and precise controllability. AnCoGen establishes a novel paradigm for speech representation learning and controllable generation, bridging traditionally disjoint objectives in speech processing.

Technology Category

Application Category

📝 Abstract

This article introduces AnCoGen, a novel method that leverages a masked autoencoder to unify the analysis, control, and generation of speech signals within a single model. AnCoGen can analyze speech by estimating key attributes, such as speaker identity, pitch, content, loudness, signal-to-noise ratio, and clarity index. In addition, it can generate speech from these attributes and allow precise control of the synthesized speech by modifying them. Extensive experiments demonstrated the effectiveness of AnCoGen across speech analysis-resynthesis, pitch estimation, pitch modification, and speech enhancement.

Problem

Research questions and friction points this paper is trying to address.

Speech Analysis

Speech Synthesis

Naturalness Control

Innovation

Methods, ideas, or system contributions that make the work stand out.

SpeechAnalysis

IntegratedModel

SpeechGeneration

🔎 Similar Papers

A Comprehensive Survey with Critical Analysis for Deepfake Speech Detection