🤖 AI Summary
This paper addresses the critical limitations of digital humans—lacking personality consistency, interactive adaptability, and self-evolutionary capability—by proposing Mio, an end-to-end multimodal interactive framework. Methodologically, it pioneers the “interactive intelligence” paradigm, introducing the Omni-Avatar architecture comprising five synergistic modules: multimodal large-model collaborative reasoning, personality-aligned controllable generation, real-time speech–expression–gesture co-driven animation, and neural rendering. Key contributions include: (1) the first comprehensive benchmark specifically designed for evaluating interactive intelligence in digital humans; and (2) state-of-the-art performance on this benchmark, achieving significant improvements in facial expression naturalness, dialogue coherence, motion coordination, personality consistency, and evolutionary capability.
📝 Abstract
We introduce Interactive Intelligence, a novel paradigm of digital human that is capable of personality-aligned expression, adaptive interaction, and self-evolution. To realize this, we present Mio (Multimodal Interactive Omni-Avatar), an end-to-end framework composed of five specialized modules: Thinker, Talker, Face Animator, Body Animator, and Renderer. This unified architecture integrates cognitive reasoning with real-time multimodal embodiment to enable fluid, consistent interaction. Furthermore, we establish a new benchmark to rigorously evaluate the capabilities of interactive intelligence. Extensive experiments demonstrate that our framework achieves superior performance compared to state-of-the-art methods across all evaluated dimensions. Together, these contributions move digital humans beyond superficial imitation toward intelligent interaction.