🤖 AI Summary
This work addresses the limitation of existing fully multimodal large language models, whose implicit Thinker-Talker architectures struggle to accurately capture contextual emotions, often resulting in distorted affective responses. To overcome this, the authors propose EmoOmni, a unified framework that introduces, for the first time, an explicit Emotional Chain-of-Thought (E-CoT) mechanism. This enables end-to-end emotional reasoning—from fine-grained multimodal perception to text generation—and explicitly guides the dialogue module via E-CoT as a high-level instruction. Concurrently, the study establishes a real-world multimodal emotional dialogue data pipeline and a dedicated evaluation benchmark, EmoOmniEval. Experimental results demonstrate that EmoOmni-7B achieves emotional dialogue performance on par with Qwen3Omni-30B-A3B-Thinking under identical Talker conditions.
📝 Abstract
The evolution of Omni-Modal Large Language Models~(Omni-LLMs) has revolutionized human--computer interaction, enabling unified audio-visual perception and speech response. However, existing Omni-LLMs struggle with complex real-world scenarios, often leading to superficial understanding and contextually mismatched emotional responses. This issue is further intensified by Omni-LLM's Thinker-Talker architectures, which are implicitly connected through hidden states, leading to the loss of emotional details. In this work, we present EmoOmni, a unified framework for accurate understanding and expression in multimodal emotional dialogue. At its core, we introduce the emotional Chain-of-Thought~(E-CoT), which enforces a reasoning from fine-grained multimodal perception to textual response. Moreover, we explicitly treat E-CoT as high-level emotional instructions that guide the talker, enabling accurate emotional expression. Complementing the model, we construct EmoOmniPipe to obtain the real-world annotated dialogue data and establish a benchmark, EmoOmniEval, to facilitate systematic assessment of multimodal emotional dialogue task. Experiments show that EmoOmni-7B achieves comparable performance with Qwen3Omni-30B-A3B-Thinking under the same talker.