EgoForge: Goal-Directed Egocentric World Simulator

๐Ÿ“… 2026-03-20
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
Existing methods struggle to effectively synthesize egocentric dynamic videos due to rapid viewpoint shifts, frequent hand-object interactions, and goal-directed human intent. This work proposes VideoDiffusionNFT, a framework that generates temporally coherent and semantically consistent first-person videos from a single egocentric image, a high-level textual instruction, and an optional exocentric view. During diffusion sampling, the method jointly optimizes task completion, temporal causality, scene consistency, and perceptual fidelity, guided by trajectory-level rewardsโ€”without requiring explicit camera trajectories or dense multi-view supervision. Experiments demonstrate that VideoDiffusionNFT significantly outperforms strong baselines in semantic alignment, geometric stability, and motion fidelity, while exhibiting robust performance in real-world smart glasses scenarios.

Technology Category

Application Category

๐Ÿ“ Abstract
Generative world models have shown promise for simulating dynamic environments, yet egocentric video remains challenging due to rapid viewpoint changes, frequent hand-object interactions, and goal-directed procedures whose evolution depends on latent human intent. Existing approaches either focus on hand-centric instructional synthesis with limited scene evolution, perform static view translation without modeling action dynamics, or rely on dense supervision, such as camera trajectories, long video prefixes, synchronized multicamera capture, etc. In this work, we introduce EgoForge, an egocentric goal-directed world simulator that generates coherent, first-person video rollouts from minimal static inputs: a single egocentric image, a high-level instruction, and an optional auxiliary exocentric view. To improve intent alignment and temporal consistency, we propose VideoDiffusionNFT, a trajectory-level reward-guided refinement that optimizes goal completion, temporal causality, scene consistency, and perceptual fidelity during diffusion sampling. Extensive experiments show EgoForge achieves consistent gains in semantic alignment, geometric stability, and motion fidelity over strong baselines, and robust performance in real-world smart-glasses experiments.
Problem

Research questions and friction points this paper is trying to address.

egocentric video
goal-directed simulation
world modeling
intent-driven dynamics
first-person vision
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric video generation
goal-directed simulation
diffusion model
reward-guided refinement
minimal supervision
๐Ÿ”Ž Similar Papers
No similar papers found.