Unispeaker: A Unified Approach for Multimodality-driven Speaker Generation

📅 2025-01-11

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

This paper addresses multimodal-driven personalized speech synthesis by proposing the first unified cross-modal speech control framework. Methodologically, it introduces KV-Former—a speech aggregator that fuses heterogeneous inputs (text, images, and audio)—and incorporates a soft contrastive loss with multimodal embedding alignment to enable end-to-end controllable speech synthesis. Key contributions include: (1) establishing MVC, the first cross-modal speech control benchmark covering five diverse speech generation tasks; (2) achieving comprehensive performance gains over unimodal specialized models on MVC, with significant improvements in speech fidelity, speaker adaptability, prosodic diversity, and naturalness; and (3) open-sourcing both the model and perceptually validated audio samples. This work establishes a novel paradigm for multimodal speech generation and provides a standardized evaluation foundation for future research.

Technology Category

Application Category

📝 Abstract

Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at url{https://UniSpeaker.github.io}.

Problem

Research questions and friction points this paper is trying to address.

Multimodal Information Integration

Natural Speech Synthesis

Personalized Voice Generation

Innovation

Methods, ideas, or system contributions that make the work stand out.

UniSpeaker

Multimodal Integration

Speech Synthesis

🔎 Similar Papers

No similar papers found.