Unified Multi-Modal Interactive & Reactive 3D Motion Generation via Rectified Flow

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the problem of multimodal-driven two-person 3D interactive motion generation. Methodologically, it introduces the first unified and efficient generative framework that jointly conditions on text, music, and prior motion sequences. It employs a rectified flow model for deterministic, low-latency synthesis; integrates a retrieval-augmented generation (RAG) module with contrastive learning to enhance cross-modal semantic alignment; and proposes a synchronization-aware loss to explicitly model temporal coordination in dyadic interaction. Innovatively, it incorporates an LLM-driven text parser and a music feature extractor to improve conditional control fidelity. Experiments demonstrate state-of-the-art performance in motion quality, rhythm synchronization, multimodal consistency, and inference efficiency—significantly outperforming diffusion-based approaches—enabling high-fidelity, low-latency two-person motion synthesis.

Technology Category

Application Category

📝 Abstract
Generating realistic, context-aware two-person motion conditioned on diverse modalities remains a central challenge in computer graphics, animation, and human-computer interaction. We introduce DualFlow, a unified and efficient framework for multi-modal two-person motion generation. DualFlow conditions 3D motion synthesis on diverse inputs, including text, music, and prior motion sequences. Leveraging rectified flow, it achieves deterministic straight-line sampling paths between noise and data, reducing inference time and mitigating error accumulation common in diffusion-based models. To enhance semantic grounding, DualFlow employs a Retrieval-Augmented Generation (RAG) module that retrieves motion exemplars using music features and LLM-based text decompositions of spatial relations, body movements, and rhythmic patterns. We use contrastive objective that further strengthens alignment with conditioning signals and introduce synchronization loss that improves inter-person coordination. Extensive evaluations across text-to-motion, music-to-motion, and multi-modal interactive benchmarks show consistent gains in motion quality, responsiveness, and efficiency. DualFlow produces temporally coherent and rhythmically synchronized motions, setting state-of-the-art in multi-modal human motion generation.
Problem

Research questions and friction points this paper is trying to address.

Generating realistic two-person motion from diverse modalities
Achieving efficient motion synthesis with reduced inference time
Enhancing semantic grounding through retrieval-augmented generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rectified flow enables deterministic straight-line motion sampling
Retrieval-Augmented Generation module enhances semantic grounding
Contrastive objective and synchronization loss improve motion alignment
🔎 Similar Papers
No similar papers found.