Visatronic: A Multimodal Decoder-Only Model for Speech Synthesis

๐Ÿ“… 2024-11-26
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 1
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing LLM-based multimodal models struggle to model cross-modal temporal dependencies, hindering precise audio-visual-textual alignment in video-to-text-to-speech (VTTS) generation. To address this, we propose the first end-to-end aligned, decoder-only multimodal architecture that jointly processes text, video frames, and speech token sequences. Our method introduces time-aligned embeddings and a dynamic cross-modal token mixing mechanism to explicitly model temporal correspondences across modalities. We further propose TimeSyncโ€”a novel phoneme-level metricโ€”to quantitatively evaluate speech-video synchronization. Trained on VoxCeleb2 and zero-shot transferred to LRS3, our model achieves a word error rate (WER) of 4.5%, substantially outperforming the state-of-the-art trained solely on LRS3 (21.4%). Moreover, it demonstrates significantly improved speech-video temporal consistency, validating both strong cross-dataset generalization and the efficacy of joint multimodal temporal modeling.

Technology Category

Application Category

๐Ÿ“ Abstract
The rapid progress of foundation models and large language models (LLMs) has fueled significantly improvement in the capabilities of machine learning systems that benefit from mutlimodal input data. However, existing multimodal models are predominantly built on top of pre-trained LLMs, which can limit accurate modeling of temporal dependencies across other modalities and thus limit the model's ability to jointly process and leverage multimodal inputs. To specifically investigate the alignment of text, video, and speech modalities in LLM-style (decoder-only) models, we consider a simplified multimodal generation task, Video-Text to Speech (VTTS): speech generation conditioned on both its corresponding text and video of talking people. The ultimate goal is to generate speech that not only follows the text but also aligns temporally with the video and is consistent with the facial expressions. In this paper, we first introduce Visatronic, a unified multimodal decoder-only transformer model that adopts an LLM-style architecture to embed visual, textual, and speech inputs into a shared subspace, treating all modalities as temporally aligned token streams. Next, we carefully explore different token mixing strategies to understand the best way to propagate information from the steps where video and text conditioning is input to the steps where the audio is generated. We extensively evaluate Visatronic on the challenging VoxCeleb2 dataset and demonstrate zero-shot generalization to LRS3, where Visatronic, trained on VoxCeleb2, achieves a 4.5% WER, outperforming prior SOTA methods trained only on LRS3, which report a 21.4% WER. Additionally, we propose a new objective metric, TimeSync, specifically designed to measure phoneme-level temporal alignment between generated and reference speech, further ensuring synchronization quality. Demo: https://apple.github.io/visatronic-demo/
Problem

Research questions and friction points this paper is trying to address.

Aligning text, video, and speech in decoder-only models
Generating speech synchronized with video and text
Improving multimodal temporal dependency modeling in LLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decoder-only transformer for multimodal synthesis
Shared subspace embedding for aligned modalities
TimeSync metric for phoneme-level alignment
๐Ÿ”Ž Similar Papers
No similar papers found.