🤖 AI Summary
In federated learning (FL) for voice cloning, high communication overhead and suppression of stylistic heterogeneity impede personalized expression. To address this, we propose a privacy-preserving disentangled federated voice cloning framework. Our method introduces: (1) a dual-path disentanglement architecture that explicitly separates speaker identity from stylistic attributes (e.g., emotion and prosody); (2) lightweight low-rank adaptation (LoRA) for efficient client-side style customization; and (3) a collaborative-filtering-based personalized model aggregation mechanism that explicitly models cross-client stylistic distribution heterogeneity. Experiments demonstrate that our approach significantly improves speech naturalness, stylistic expressiveness, and speaker similarity while preserving strong privacy guarantees. Moreover, it reduces communication overhead by 42.6% compared to standard FedAvg.
📝 Abstract
Voice cloning for Text-to-Speech (TTS) aims to generate expressive and personalized speech from text using limited data from a target speaker. Federated Learning (FL) offers a collaborative and privacy-preserving framework for this task, but existing approaches suffer from high communication costs and tend to suppress stylistic heterogeneity, resulting in insufficient personalization. To address these issues, we propose Fed-PISA, which stands for Federated Personalized Identity-Style Adaptation. To minimize communication costs, Fed-PISA introduces a disentangled Low-Rank Adaptation (LoRA) mechanism: the speaker's timbre is retained locally through a private ID-LoRA, while only a lightweight style-LoRA is transmitted to the server, thereby minimizing parameter exchange. To harness heterogeneity, our aggregation method, inspired by collaborative filtering, is introduced to create custom models for each client by learning from stylistically similar peers. Experiments show that Fed-PISA improves style expressivity, naturalness, and speaker similarity, outperforming standard federated baselines with minimal communication costs.