Towards Interactive Intelligence for Digital Humans

📅 2025-12-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper addresses the critical limitations of digital humans—lacking personality consistency, interactive adaptability, and self-evolutionary capability—by proposing Mio, an end-to-end multimodal interactive framework. Methodologically, it pioneers the “interactive intelligence” paradigm, introducing the Omni-Avatar architecture comprising five synergistic modules: multimodal large-model collaborative reasoning, personality-aligned controllable generation, real-time speech–expression–gesture co-driven animation, and neural rendering. Key contributions include: (1) the first comprehensive benchmark specifically designed for evaluating interactive intelligence in digital humans; and (2) state-of-the-art performance on this benchmark, achieving significant improvements in facial expression naturalness, dialogue coherence, motion coordination, personality consistency, and evolutionary capability.

Technology Category

Application Category

📝 Abstract
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.
Problem

Research questions and friction points this paper is trying to address.

Develops interactive digital humans with personality and adaptability
Creates an end-to-end framework for multimodal avatar integration
Establishes a benchmark to evaluate interactive intelligence capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

End-to-end multimodal framework for digital humans
Integrated cognitive reasoning with real-time embodiment
New benchmark for evaluating interactive intelligence capabilities
🔎 Similar Papers
No similar papers found.
Yiyi Cai
Yiyi Cai
California Institute of Technology
Quantum Information Theory
Xuangeng Chu
Xuangeng Chu
The University of Tokyo
3D Computer VisionVirtual HumansDigital Humans
X
Xiwei Gao
Shanda AI Research Tokyo
S
Sitong Gong
Shanda AI Research Tokyo
Y
Yifei Huang
Shanda AI Research Tokyo, The University of Tokyo
Caixin Kang
Caixin Kang
The University of Tokyo
Computer VisionTrustworthy AIAutonomous DrivingGenerative Models
K
Kunhang Li
Shanda AI Research Tokyo, The University of Tokyo
Haiyang Liu
Haiyang Liu
The University of Tokyo
Human Video GenerationMotion GenerationMulti-Modal Understanding and Generation
Ruicong Liu
Ruicong Liu
The University of Tokyo
computer vision
Y
Yun Liu
Shanda AI Research Tokyo, National Institute of Informatics
Dianwen Ng
Dianwen Ng
MiroMind, Alibaba-NTU Singapore Joint Research Institute
Artificial IntelligenceDeep LearningSpeech RecognitionSelf-supervised Learning
Zixiong Su
Zixiong Su
The University of Tokyo
Human-Computer InteractionSilent Speech InterfaceHuman-AI Interaction
Erwin Wu
Erwin Wu
Tokyo Institute of Technology
Computer VisionHuman Computer Interaction
Yuhan Wu
Yuhan Wu
Peking University, Ph.D. student in CS, yuhan.wu [at] pku.edu.cn My Chinese name is 吴钰晗
Data StructuresNetworkingBig Data
D
Dingkun Yan
Shanda AI Research Tokyo
T
Tianyu Yan
Shanda AI Research Tokyo
Chang Zeng
Chang Zeng
National Institute of Informatics
speech processingspeech/singing synthesisaudio/music generationspeaker recognition
B
Bo Zheng
Shanda AI Research Tokyo
Y
You Zhou
Shanda AI Research Tokyo