🤖 AI Summary
This work addresses the challenges of rapid viewpoint changes, self-occlusion, and dynamic consistency in first-person video generation by proposing the E³C framework, which innovatively decouples scene structure from human motion dynamics. The method constructs a semi-dense point cloud–based 3D environmental memory and introduces a persistent egocentric motion encoding. Integrated with a video diffusion model, video-VAE features, skeleton rendering, and 6DoF wrist motion control, E³C enables high-fidelity and controllable video synthesis. Evaluated on the Nymeria dataset, E³C significantly outperforms existing approaches, achieving state-of-the-art performance in visual fidelity, camera motion accuracy, object consistency, and control over both ego- and exo-centric poses.
📝 Abstract
Controllable and physically grounded egocentric video generation is essential for embodied agents to reason about how their own and others' actions manifest and change the world. Compared to generic video synthesis, egocentric generation is especially challenging: the camera is tightly coupled to the actor, leading to rapid viewpoint changes and frequent self-occlusions; the underlying actions are subtle, articulated, and often only partially visible; and both the people and the scene state must evolve consistently with the specified controls. We present E$^3$C, a controllable video diffusion framework for egocentric generation that builds structured and compact conditions disentangling persistent scene structure from human-driven dynamics. From context frames, E$^3$C constructs a semi-dense point cloud-based 3D memory and augments each point with appearance descriptors from video-VAE features. Rendering this memory into target viewpoints produces conditioning aligned with the target frames. Human dynamics are modeled separately. The observed people in the scene are controlled by skeleton renderings (exo human control), while the camera wearer is specified by their 3D body joints and 6DoF wrist motion (ego human control). To preserve ego human control when the wearer's body parts are invisible, we introduce an ego motion encoder that produces persistent cross-attention tokens. Experiments on Nymeria show that E$^3$C improves visual fidelity, camera-motion accuracy, object consistency, and ego & exo human control over strong baselines, while also enabling intuitive scene editing.