π€ AI Summary
This paper addresses multimodal-driven personalized speech synthesis by proposing the first unified cross-modal speech control framework. Methodologically, it introduces KV-Formerβa speech aggregator that fuses heterogeneous inputs (text, images, and audio)βand incorporates a soft contrastive loss with multimodal embedding alignment to enable end-to-end controllable speech synthesis. Key contributions include: (1) establishing MVC, the first cross-modal speech control benchmark covering five diverse speech generation tasks; (2) achieving comprehensive performance gains over unimodal specialized models on MVC, with significant improvements in speech fidelity, speaker adaptability, prosodic diversity, and naturalness; and (3) open-sourcing both the model and perceptually validated audio samples. This work establishes a novel paradigm for multimodal speech generation and provides a standardized evaluation foundation for future research.
π Abstract
Recent advancements in personalized speech generation have brought synthetic speech increasingly close to the realism of target speakers' recordings, yet multimodal speaker generation remains on the rise. This paper introduces UniSpeaker, a unified approach for multimodality-driven speaker generation. Specifically, we propose a unified voice aggregator based on KV-Former, applying soft contrastive loss to map diverse voice description modalities into a shared voice space, ensuring that the generated voice aligns more closely with the input descriptions. To evaluate multimodality-driven voice control, we build the first multimodality-based voice control (MVC) benchmark, focusing on voice suitability, voice diversity, and speech quality. UniSpeaker is evaluated across five tasks using the MVC benchmark, and the experimental results demonstrate that UniSpeaker outperforms previous modality-specific models. Speech samples are available at url{https://UniSpeaker.github.io}.