U-Mind: A Unified Framework for Real-Time Multimodal Interaction with Audiovisual Generation

📅 2026-02-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing systems struggle to support natural and coherent embodied agent interactions in real-time full-stack multimodal settings due to limitations in single-modality processing, weak cross-modal alignment, and degraded reasoning capabilities. This work proposes U-Mind, a unified framework that jointly generates language, speech, actions, and video within a single interactive loop. U-Mind introduces a novel segment-wise alignment strategy and a paraphrasing-driven learning mechanism to simultaneously preserve cross-modal synchronization and reasoning fidelity. It further integrates text-prioritized decoding with pose- and speech-conditioned real-time video rendering. Evaluated on tasks including question answering, instruction following, and action generation, the approach achieves state-of-the-art performance, significantly enhancing interaction coherence, temporal synchrony, and expressive richness.

Technology Category

Application Category

📝 Abstract
Full-stack multimodal interaction in real-time is a central goal in building intelligent embodied agents capable of natural, dynamic communication. However, existing systems are either limited to unimodal generation or suffer from degraded reasoning and poor cross-modal alignment, preventing coherent and perceptually grounded interactions. In this work, we introduce U-Mind, the first unified system for high-intelligence multimodal dialogue that supports real-time generation and jointly models language, speech, motion, and video synthesis within a single interactive loop. At its core, U-Mind implements a Unified Alignment and Reasoning Framework that addresses two key challenges: enhancing cross-modal synchronization via a segment-wise alignment strategy, and preserving reasoning abilities through Rehearsal-Driven Learning. During inference, U-Mind adopts a text-first decoding pipeline that performs internal chain-of-thought planning followed by temporally synchronized generation across modalities. To close the loop, we implement a real-time video rendering framework conditioned on pose and speech, enabling expressive and synchronized visual feedback. Extensive experiments demonstrate that U-Mind achieves state-of-the-art performance on a range of multimodal interaction tasks, including question answering, instruction following, and motion generation, paving the way toward intelligent, immersive conversational agents.
Problem

Research questions and friction points this paper is trying to address.

multimodal interaction
real-time generation
cross-modal alignment
reasoning degradation
embodied agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified Alignment and Reasoning
Rehearsal-Driven Learning
Multimodal Synchronization
Real-Time Audiovisual Generation
Chain-of-Thought Planning
🔎 Similar Papers
No similar papers found.