Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

πŸ“… 2025-01-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This paper addresses multimodal-driven personalized speech synthesis by proposing the first unified cross-modal speech control framework. Methodologically, it introduces KV-Formerβ€”a speech aggregator that fuses heterogeneous inputs (text, images, and audio)β€”and incorporates a soft contrastive loss with multimodal embedding alignment to enable end-to-end controllable speech synthesis. Key contributions include: (1) establishing MVC, the first cross-modal speech control benchmark covering five diverse speech generation tasks; (2) achieving comprehensive performance gains over unimodal specialized models on MVC, with significant improvements in speech fidelity, speaker adaptability, prosodic diversity, and naturalness; and (3) open-sourcing both the model and perceptually validated audio samples. This work establishes a novel paradigm for multimodal speech generation and provides a standardized evaluation foundation for future research.

Technology Category

Application Category

πŸ“ Abstract
Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at url{https://UniSpeaker.github.io}.
Problem

Research questions and friction points this paper is trying to address.

Multimodal Information Integration
Natural Speech Synthesis
Personalized Voice Generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

UniSpeaker
Multimodal Integration
Speech Synthesis
πŸ”Ž Similar Papers
No similar papers found.