🤖 AI Summary
This work addresses the gradient conflict arising from the tight coupling of semantic reasoning and motion modeling in egocentric vision–language conditioned 3D human motion generation. Inspired by the biological decoupling of cognitive and motor control, we propose a two-stage hierarchical generative framework. In the first stage, a vision–language model maps multimodal inputs into discrete action primitives, yielding a goal-aligned semantic representation. In the second stage, this representation conditions a diffusion model that iteratively denoises in a continuous latent space to produce physically plausible and temporally coherent 3D motion sequences. By explicitly disentangling semantic inference from motion synthesis for the first time, our approach effectively mitigates multimodal optimization conflicts and achieves state-of-the-art performance, significantly outperforming existing methods in both semantic alignment and motion fidelity.
📝 Abstract
Faithfully modeling human behavior in dynamic environments is a foundational challenge for embodied intelligence. While conditional motion synthesis has achieved significant advances, egocentric motion generation remains largely underexplored due to the inherent complexity of first-person perception. In this work, we investigate Egocentric Vision-Language (Ego-VL) motion generation. This task requires synthesizing 3D human motion conditioned jointly on first-person visual observations and natural language instructions. We identify a critical \textit{reasoning-generation entanglement} challenge: the simultaneous optimization of semantic reasoning and kinematic modeling introduces gradient conflicts. These conflicts systematically degrade the fidelity of multimodal grounding and motion quality. To address this challenge, we propose a hierarchical generative framework \textbf{EgoMotion}. Inspired by the biological decoupling of cognitive reasoning and motor control, EgoMotion operates in two stages. In the Cognitive Reasoning stage, A vision-language model (VLM) projects multimodal inputs into a structured space of discrete motion primitives. This forces the VLM to acquire goal-consistent representations, effectively bridging the semantic gap between high-level perceptual understanding and low-level action execution. In the Motion Generation stage, these learned representations serve as expressive conditioning signals for a diffusion-based motion generator. By performing iterative denoising within a continuous latent space, the generator synthesizes physically plausible and temporally coherent trajectories. Extensive evaluations demonstrate that EgoMotion achieves state-of-the-art performance, and produces motion sequences that are both semantically grounded and kinematically superior to existing approaches.