EgoControl: Controllable Egocentric Video Generation via 3D Full-Body Poses

📅 2025-11-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of fine-grained action control in egocentric video generation for embodied AI. We propose a controllable first-person video generation method conditioned on 3D full-body pose sequences. Our core innovation lies in (1) a novel 3D pose representation that jointly encodes global camera motion and local joint kinematics, and (2) an explicit pose-conditioned control network embedded into the denoising process of a diffusion model. By jointly constraining body motion and viewpoint dynamics during temporal modeling, our approach significantly improves pose consistency and visual realism in generated videos. Experiments demonstrate that our method produces high-fidelity, temporally coherent egocentric videos across diverse scenarios, outperforming prior works in both action controllability and pose prediction accuracy. The framework provides an interpretable and intervenable visual generation foundation for action simulation, forecasting, and planning in embodied agents.

Technology Category

Application Category

📝 Abstract
Egocentric video generation with fine-grained control through body motion is a key requirement towards embodied AI agents that can simulate, predict, and plan actions. In this work, we propose EgoControl, a pose-controllable video diffusion model trained on egocentric data. We train a video prediction model to condition future frame generation on explicit 3D body pose sequences. To achieve precise motion control, we introduce a novel pose representation that captures both global camera dynamics and articulated body movements, and integrate it through a dedicated control mechanism within the diffusion process. Given a short sequence of observed frames and a sequence of target poses, EgoControl generates temporally coherent and visually realistic future frames that align with the provided pose control. Experimental results demonstrate that EgoControl produces high-quality, pose-consistent egocentric videos, paving the way toward controllable embodied video simulation and understanding.
Problem

Research questions and friction points this paper is trying to address.

Generating controllable egocentric videos using 3D body poses
Achieving fine-grained motion control in video generation
Aligning generated frames with precise pose sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses video diffusion model for egocentric generation
Introduces novel 3D pose representation for motion
Integrates pose control mechanism in diffusion process
🔎 Similar Papers
No similar papers found.