Fed-PISA: Federated Voice Cloning via Personalized Identity-Style Adaptation

📅 2025-09-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In federated learning (FL) for voice cloning, high communication overhead and suppression of stylistic heterogeneity impede personalized expression. To address this, we propose a privacy-preserving disentangled federated voice cloning framework. Our method introduces: (1) a dual-path disentanglement architecture that explicitly separates speaker identity from stylistic attributes (e.g., emotion and prosody); (2) lightweight low-rank adaptation (LoRA) for efficient client-side style customization; and (3) a collaborative-filtering-based personalized model aggregation mechanism that explicitly models cross-client stylistic distribution heterogeneity. Experiments demonstrate that our approach significantly improves speech naturalness, stylistic expressiveness, and speaker similarity while preserving strong privacy guarantees. Moreover, it reduces communication overhead by 42.6% compared to standard FedAvg.

Technology Category

Application Category

📝 Abstract
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.
Problem

Research questions and friction points this paper is trying to address.

Reduces communication costs in federated voice cloning
Preserves speaker stylistic heterogeneity for personalization
Enhances style expressivity and speaker similarity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Disentangled LoRA mechanism for efficiency
Collaborative filtering aggregation for personalization
Lightweight style-LoRA transmission reducing costs
🔎 Similar Papers
No similar papers found.
Q
Qi Wang
State Key Laboratory of AI Safety, Institute of Computing Technology, Chinese Academy of Sciences; Peng Cheng Laboratory; University of Chinese Academy of Sciences
S
Shituo Ma
Institute of Information Engineering, Chinese Academy of Sciences; University of Chinese Academy of Sciences
G
Guoxin Yu
Peng Cheng Laboratory; University of Chinese Academy of Sciences
Hanyang Peng
Hanyang Peng
Peng Cheng Laboratory
Deep LearningOptimization
Y
Yue Yu
Peng Cheng Laboratory; University of Chinese Academy of Sciences