🤖 AI Summary
This work addresses the underexplored challenge of jointly generating egocentric videos and corresponding human body motions—requiring synchronous modeling of visual content and camera motion induced by the wearer’s body. To this end, we propose a head-pose-anchored motion representation and a cybernetics-inspired causal interaction mechanism that explicitly captures bidirectional dynamic dependencies between video and motion sequences. Built upon a diffusion Transformer architecture, our framework incorporates cross-modal bidirectional attention to enable end-to-end joint generation of video and motion sequences. We introduce a large-scale, in-house text–video–motion triplet dataset for evaluation. Quantitative results, assessed via novel viewpoint-alignment and motion-consistency metrics, demonstrate significant improvements over existing baselines. Our approach establishes a new paradigm for egocentric generation: interpretable, high-fidelity, and co-modeling of vision and embodied dynamics.
📝 Abstract
While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.