EgoTwin: Dreaming Body and View in First Person

📅 2025-08-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the underexplored challenge of jointly generating egocentric videos and corresponding human body motions—requiring synchronous modeling of visual content and camera motion induced by the wearer’s body. To this end, we propose a head-pose-anchored motion representation and a cybernetics-inspired causal interaction mechanism that explicitly captures bidirectional dynamic dependencies between video and motion sequences. Built upon a diffusion Transformer architecture, our framework incorporates cross-modal bidirectional attention to enable end-to-end joint generation of video and motion sequences. We introduce a large-scale, in-house text–video–motion triplet dataset for evaluation. Quantitative results, assessed via novel viewpoint-alignment and motion-consistency metrics, demonstrate significant improvements over existing baselines. Our approach establishes a new paradigm for egocentric generation: interpretable, high-fidelity, and co-modeling of vision and embodied dynamics.

Technology Category

Application Category

📝 Abstract
While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.
Problem

Research questions and friction points this paper is trying to address.

Modeling first-person view content and camera motion patterns
Aligning camera trajectory with head trajectory in videos
Ensuring causal interplay between synthesized motion and visual dynamics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Head-centric motion representation for alignment
Cybernetics-inspired interaction mechanism for causality
Diffusion transformer architecture for joint generation
🔎 Similar Papers
No similar papers found.
J
Jingqiao Xiu
National University of Singapore
Fangzhou Hong
Fangzhou Hong
Nanyang Technological University
3D Computer Vision
Y
Yicong Li
National University of Singapore
M
Mengze Li
Hong Kong University of Science and Technology
W
Wentao Wang
Shanghai AI Laboratory
Sirui Han
Sirui Han
The Hong Kong University of Science and Technology
Large Language ModelInterdisciplinary Artificial Intelligence
L
Liang Pan
Shanghai AI Laboratory
Ziwei Liu
Ziwei Liu
Associate Professor, Nanyang Technological University
Computer VisionMachine LearningComputer Graphics