Controllable Egocentric Video Generation via Occlusion-Aware Sparse 3D Hand Joints

📅 2026-03-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods for egocentric video generation struggle to achieve fine-grained and 3D-consistent hand motion control, often suffering from occlusion sensitivity and limited generalization across diverse embodied agents. This work proposes a sparse 3D hand joint-based control framework that leverages occlusion-aware feature extraction, 3D-weighted motion propagation, and latent-space geometric embedding to generate high-fidelity, naturally interactive videos from a single reference frame. Notably, the approach achieves high-quality generalization to heterogeneous embodiments—such as robotic hands—for the first time, supported by a newly constructed million-scale automatically annotated dataset. Experimental results demonstrate that the proposed method significantly outperforms current state-of-the-art techniques in both generation quality and cross-embodiment transfer performance.

Technology Category

Application Category

📝 Abstract
Motion-controllable video generation is crucial for egocentric applications in virtual reality and embodied AI. However, existing methods often struggle to achieve 3D-consistent fine-grained hand articulation. By adopting on 2D trajectories or implicit poses, they collapse 3D geometry into spatially ambiguous signals or over rely on human-centric priors. Under severe egocentric occlusions, this causes motion inconsistencies and hallucinated artifacts, as well as preventing cross-embodiment generalization to robotic hands. To address these limitations, we propose a novel framework that generates egocentric videos from a single reference frame, leveraging sparse 3D hand joints as embodiment-agnostic control signals with clear semantic and geometric structures. We introduce an efficient control module that resolves occlusion ambiguities while fully preserving 3D information. Specifically, it extracts occlusion-aware features from the source reference frame by penalizing unreliable visual signals from hidden joints, and employs a 3D-based weighting mechanism to robustly handle dynamically occluded target joints during motion propagation. Concurrently, the module directly injects 3D geometric embeddings into the latent space to strictly enforce structural consistency. To facilitate robust training and evaluation, we develop an automated annotation pipeline that yields over one million high-quality egocentric video clips paired with precise hand trajectories. Additionally, we register humanoid kinematic and camera data to construct a cross-embodiment benchmark. Extensive experiments demonstrate that our approach significantly outperforms state-of-the-art baselines, generating high-fidelity egocentric videos with realistic interactions and exhibiting exceptional cross-embodiment generalization to robotic hands.
Problem

Research questions and friction points this paper is trying to address.

egocentric video generation
3D hand articulation
occlusion
motion control
cross-embodiment generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

occlusion-aware
sparse 3D hand joints
embodiment-agnostic control
egocentric video generation
3D geometric consistency
🔎 Similar Papers
No similar papers found.