Nano-EmoX: Unifying Multimodal Emotional Intelligence from Perception to Empathy

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the fragmentation and limited generalization of affective capabilities in existing multimodal language models, which stems from a disconnect between low-level perception and high-level interaction. To bridge this gap, the authors propose a cognition-inspired three-tiered emotional task framework—encompassing perception, understanding, and interaction—and introduce Nano-EmoX, a lightweight multitask model trained via a progressive P2E (Perception-to-Empathy) framework. For the first time, this approach unifies six core affective tasks within a 2.2B-parameter model. Key innovations include an enhanced facial encoder, a multimodal fusion module, heterogeneous adapters, and a chain-of-thought-driven curriculum learning strategy, enabling unified modeling from perception to empathy. Experiments demonstrate state-of-the-art or highly competitive performance across multiple benchmarks, with significant gains in both efficiency and cross-task generalization.

Technology Category

Application Category

📝 Abstract
The development of affective multimodal language models (MLMs) has long been constrained by a gap between low-level perception and high-level interaction, leading to fragmented affective capabilities and limited generalization. To bridge this gap, we propose a cognitively inspired three-level hierarchy that organizes affective tasks according to their cognitive depth-perception, understanding, and interaction-and provides a unified conceptual foundation for advancing affective modeling. Guided by this hierarchy, we introduce Nano-EmoX, a small-scale multitask MLM, and P2E (Perception-to-Empathy), a curriculum-based training framework. Nano-EmoX integrates a suite of omni-modal encoders, including an enhanced facial encoder and a fusion encoder, to capture key multimodal affective cues and improve cross-task transferability. The outputs are projected into a unified language space via heterogeneous adapters, empowering a lightweight language model to tackle diverse affective tasks. Concurrently, P2E progressively cultivates emotional intelligence by aligning rapid perception with chain-of-thought-driven empathy. To the best of our knowledge, Nano-EmoX is the first compact MLM (2.2B) to unify six core affective tasks across all three hierarchy levels, achieving state-of-the-art or highly competitive performance across multiple benchmarks, demonstrating excellent efficiency and generalization.
Problem

Research questions and friction points this paper is trying to address.

affective multimodal language models
perception-interaction gap
emotional intelligence
multimodal affective modeling
task generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

multimodal language model
emotional intelligence
perception-to-empathy
cognitive hierarchy
heterogeneous adapters
🔎 Similar Papers
No similar papers found.
J
Jiahao Huang
Fujian Normal University
F
Fengyan Lin
Fujian Normal University
Xuechao Yang
Xuechao Yang
School of Computing Technologies, RMIT University
CryptographyPrivacy preservingCybersecurityBlockchain technology
F
Feng Chen
K
Kexin Zhu
X
Xu Yang
Minjiang University
Zhide Chen
Zhide Chen
Fujian Normal University;Fudan University
blockchain