InterMoE: Individual-Specific 3D Human Interaction Generation via Dynamic Temporal-Selective MoE

📅 2025-11-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for generating personalized 3D human interaction motions struggle to simultaneously preserve identity fidelity and ensure precise text–motion semantic alignment. To address this, we propose the Dynamic Temporal Selective Mixture-of-Experts (DT-SEMoE) architecture, which integrates text-guided semantic modeling with temporal decomposition of motion features. DT-SEMoE introduces a dynamic routing mechanism that adaptively focuses on salient keyframes, enabling specialized experts to collaboratively model high-level semantics and low-level motion context. This design achieves strong identity consistency while significantly improving text–motion alignment accuracy. Evaluated on InterHuman and InterX benchmarks, our method reduces Fréchet Inception Distance (FID) by 9% and 22%, respectively, establishing new state-of-the-art performance for identity-specific 3D interactive motion generation.

Technology Category

Application Category

📝 Abstract
Generating high-quality human interactions holds significant value for applications like virtual reality and robotics. However, existing methods often fail to preserve unique individual characteristics or fully adhere to textual descriptions. To address these challenges, we introduce InterMoE, a novel framework built on a Dynamic Temporal-Selective Mixture of Experts. The core of InterMoE is a routing mechanism that synergistically uses both high-level text semantics and low-level motion context to dispatch temporal motion features to specialized experts. This allows experts to dynamically determine the selection capacity and focus on critical temporal features, thereby preserving specific individual characteristic identities while ensuring high semantic fidelity. Extensive experiments show that InterMoE achieves state-of-the-art performance in individual-specific high-fidelity 3D human interaction generation, reducing FID scores by 9% on the InterHuman dataset and 22% on InterX.
Problem

Research questions and friction points this paper is trying to address.

Generates individual-specific 3D human interactions preserving unique characteristics
Addresses semantic fidelity issues with textual descriptions in interaction generation
Improves motion quality through dynamic temporal-selective expert routing
Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic Temporal-Selective Mixture of Experts framework
Routing mechanism using text semantics and motion context
Specialized experts focusing on critical temporal features
🔎 Similar Papers
No similar papers found.
L
Lipeng Wang
School of Software, Beihang University
H
Hongxing Fan
School of Computer Science and Engineering, Beihang University
H
Haohua Chen
School of Software, Beihang University
Zehuan Huang
Zehuan Huang
Beihang University
Generative ModelComputer Vision
Lu Sheng
Lu Sheng
School of Software, Beihang University
Embodied AI3D VisionMachine Learning