EgoTwin: Dreaming Body and View in First Person

📅 2025-08-18

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the underexplored challenge of jointly generating egocentric videos and corresponding human body motions—requiring synchronous modeling of visual content and camera motion induced by the wearer’s body. To this end, we propose a head-pose-anchored motion representation and a cybernetics-inspired causal interaction mechanism that explicitly captures bidirectional dynamic dependencies between video and motion sequences. Built upon a diffusion Transformer architecture, our framework incorporates cross-modal bidirectional attention to enable end-to-end joint generation of video and motion sequences. We introduce a large-scale, in-house text–video–motion triplet dataset for evaluation. Quantitative results, assessed via novel viewpoint-alignment and motion-consistency metrics, demonstrate significant improvements over existing baselines. Our approach establishes a new paradigm for egocentric generation: interpretable, high-fidelity, and co-modeling of vision and embodied dynamics.

Technology Category

Application Category

📝 Abstract

While exocentric video synthesis has achieved great progress, egocentric video generation remains largely underexplored, which requires modeling first-person view content along with camera motion patterns induced by the wearer's body movements. To bridge this gap, we introduce a novel task of joint egocentric video and human motion generation, characterized by two key challenges: 1) Viewpoint Alignment: the camera trajectory in the generated video must accurately align with the head trajectory derived from human motion; 2) Causal Interplay: the synthesized human motion must causally align with the observed visual dynamics across adjacent video frames. To address these challenges, we propose EgoTwin, a joint video-motion generation framework built on the diffusion transformer architecture. Specifically, EgoTwin introduces a head-centric motion representation that anchors the human motion to the head joint and incorporates a cybernetics-inspired interaction mechanism that explicitly captures the causal interplay between video and motion within attention operations. For comprehensive evaluation, we curate a large-scale real-world dataset of synchronized text-video-motion triplets and design novel metrics to assess video-motion consistency. Extensive experiments demonstrate the effectiveness of the EgoTwin framework.

Problem

Research questions and friction points this paper is trying to address.

Modeling first-person view content and camera motion patterns

Aligning camera trajectory with head trajectory in videos

Ensuring causal interplay between synthesized motion and visual dynamics

Innovation

Methods, ideas, or system contributions that make the work stand out.

Head-centric motion representation for alignment

Cybernetics-inspired interaction mechanism for causality

Diffusion transformer architecture for joint generation

🔎 Similar Papers

No similar papers found.

Authors to Follow