EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

📅 2024-09-26

🏛️ arXiv.org

📈 Citations: 19

✨ Influential: 3

career value

200K/year

🤖 AI Summary

Current open-source multimodal models suffer from a fundamental modality fragmentation: vision-language models (VLMs) lack end-to-end speech generation capability, while speech-language models (SLMs) lack visual understanding. This work introduces the first open-source, end-to-end unified vision-language-speech model. Our method comprises three core innovations: (1) a semantic-acoustic disentangled speech tokenizer enabling high-fidelity speech modeling; (2) a full-modality alignment training framework that enables joint tri-modal learning to mutually enhance dual-modal performance; and (3) a lightweight style module supporting fine-grained control over emotion and pitch. Through multi-stage joint fine-tuning, our model achieves state-of-the-art results on vision-language benchmarks (VQAv2, OK-VQA) and speech benchmarks (LibriSpeech, CommonVoice). Notably, it is the first open model to realize high-fidelity, emotionally expressive, end-to-end multimodal spoken dialogue.

Technology Category

Application Category

📝 Abstract

GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.

Problem

Research questions and friction points this paper is trying to address.

Enable LLMs to process images, texts, and speeches end-to-end.

Enhance vision-language and speech abilities through omni-modal alignment.

Achieve state-of-the-art performance in vision-language and speech benchmarks.

Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end speech abilities for LLMs

Semantic-acoustic disentangled speech tokenizer

Lightweight style module for speech control

🔎 Similar Papers

AER-LLM: Ambiguity-aware Emotion Recognition Leveraging Large Language Models