Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

📅 2025-09-19

📈 Citations: 0

✨ Influential: 0

career value

209K/year

🤖 AI Summary

In federated learning (FL) for voice cloning, high communication overhead and suppression of stylistic heterogeneity impede personalized expression. To address this, we propose a privacy-preserving disentangled federated voice cloning framework. Our method introduces: (1) a dual-path disentanglement architecture that explicitly separates speaker identity from stylistic attributes (e.g., emotion and prosody); (2) lightweight low-rank adaptation (LoRA) for efficient client-side style customization; and (3) a collaborative-filtering-based personalized model aggregation mechanism that explicitly models cross-client stylistic distribution heterogeneity. Experiments demonstrate that our approach significantly improves speech naturalness, stylistic expressiveness, and speaker similarity while preserving strong privacy guarantees. Moreover, it reduces communication overhead by 42.6% compared to standard FedAvg.

Technology Category

Application Category

📝 Abstract

Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.

Problem

Research questions and friction points this paper is trying to address.

Reduces communication costs in federated voice cloning

Preserves speaker stylistic heterogeneity for personalization

Enhances style expressivity and speaker similarity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled LoRA mechanism for efficiency

Collaborative filtering aggregation for personalization

Lightweight style-LoRA transmission reducing costs

🔎 Similar Papers

People are poorly equipped to detect AI-powered voice clones