EMOVA: Empowering Language Models to See, Hear and Speak with Vivid Emotions

📅 2024-09-26
🏛️ arXiv.org
📈 Citations: 19
Influential: 3
📄 PDF
🤖 AI Summary
Current open-source multimodal models suffer from a fundamental modality fragmentation: vision-language models (VLMs) lack end-to-end speech generation capability, while speech-language models (SLMs) lack visual understanding. This work introduces the first open-source, end-to-end unified vision-language-speech model. Our method comprises three core innovations: (1) a semantic-acoustic disentangled speech tokenizer enabling high-fidelity speech modeling; (2) a full-modality alignment training framework that enables joint tri-modal learning to mutually enhance dual-modal performance; and (3) a lightweight style module supporting fine-grained control over emotion and pitch. Through multi-stage joint fine-tuning, our model achieves state-of-the-art results on vision-language benchmarks (VQAv2, OK-VQA) and speech benchmarks (LibriSpeech, CommonVoice). Notably, it is the first open model to realize high-fidelity, emotionally expressive, end-to-end multimodal spoken dialogue.

Technology Category

Application Category

📝 Abstract
GPT-4o, an omni-modal model that enables vocal conversations with diverse emotions and tones, marks a milestone for omni-modal foundation models. However, empowering Large Language Models to perceive and generate images, texts, and speeches end-to-end with publicly available data remains challenging for the open-source community. Existing vision-language models rely on external tools for speech processing, while speech-language models still suffer from limited or totally without vision-understanding capabilities. To address this gap, we propose the EMOVA (EMotionally Omni-present Voice Assistant), to enable Large Language Models with end-to-end speech abilities while maintaining the leading vision-language performance. With a semantic-acoustic disentangled speech tokenizer, we surprisingly notice that omni-modal alignment can further enhance vision-language and speech abilities compared with the bi-modal aligned counterparts. Moreover, a lightweight style module is introduced for the flexible speech style controls including emotions and pitches. For the first time, EMOVA achieves state-of-the-art performance on both the vision-language and speech benchmarks, and meanwhile, supporting omni-modal spoken dialogue with vivid emotions.
Problem

Research questions and friction points this paper is trying to address.

Enable LLMs to process images, texts, and speeches end-to-end.
Enhance vision-language and speech abilities through omni-modal alignment.
Achieve state-of-the-art performance in vision-language and speech benchmarks.
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end speech abilities for LLMs
Semantic-acoustic disentangled speech tokenizer
Lightweight style module for speech control
🔎 Similar Papers
No similar papers found.
K
Kai Chen
Hong Kong University of Science and Technology
Y
Yunhao Gou
Hong Kong University of Science and Technology, Southern University of Science and Technology
Runhui Huang
Runhui Huang
The University of Hong Kong, Sun Yat-sen University
Zhili Liu
Zhili Liu
Beike
SLAMDLHPCComputer Graphics
D
Daxin Tan
Huawei Noah’s Ark Lab
J
Jing Xu
The Chinese University of Hong Kong
Chunwei Wang
Chunwei Wang
Researcher, Huawei Noah's Ark Lab
Computer VisionAutonomous DrivingMultimodality
Y
Yi Zhu
Huawei Noah’s Ark Lab
Y
Yihan Zeng
Huawei Noah’s Ark Lab
K
Kuo Yang
Huawei Noah’s Ark Lab
D
Dingdong Wang
The Chinese University of Hong Kong
Kun Xiang
Kun Xiang
School of Intelligent Systems Engineering, Sun Yat-Sen University
Symbolic ReasoningMedical Imaging Process
H
Haoyuan Li
Sun Yat-sen University
Haoli Bai
Haoli Bai
Huawei Technologies
natural language processingmodel compression
Jianhua Han
Jianhua Han
2030 Research, YinWang, Huawei
Vision Language ModelFoundation ModelVLA
X
Xiaohui Li
Huawei Noah’s Ark Lab
Weike Jin
Weike Jin
Zhejiang University
multi-modal learningcomputer visiondeep learning
N
Nian Xie
Huawei Noah’s Ark Lab
Y
Yu Zhang
Southern University of Science and Technology
James T. Kwok
James T. Kwok
Professor of Computer Science and Engineering, Hong Kong University of Science and Technology
Machine learning
Hengshuang Zhao
Hengshuang Zhao
The University of Hong Kong
Computer VisionMachine LearningArtificial Intelligence
Xiaodan Liang
Xiaodan Liang
Professor of Computer Science, Sun Yat-sen University, MBZUAI, CMU, NUS
Computer visionEmbodied AIMachine learning
Dit-Yan Yeung
Dit-Yan Yeung
Chair Professor, Department of CSE, HKUST, Hong Kong
Machine LearningArtificial IntelligenceComputer Vision
X
Xiao Chen
Huawei Noah’s Ark Lab
Zhenguo Li
Zhenguo Li
Huawei Noah's Ark Lab, Columbia, CUHK, PKU
machine learninggenerative AIAI for mathematics
W
Wei Zhang
Huawei Noah’s Ark Lab
Q
Qun Liu
Huawei Noah’s Ark Lab
L
Lanqing Hong
Huawei Noah’s Ark Lab
L
Lu Hou
Huawei Noah’s Ark Lab
H
Hang Xu
Huawei Noah’s Ark Lab