EgoSim: Egocentric World Simulator for Embodied Interaction Generation

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing first-person simulators struggle to maintain spatially consistent world states across multi-stage interactions due to the absence of an explicit, updatable 3D scene representation, often resulting in structural drift and inconsistent interactions under viewpoint changes. To address this, this work proposes a closed-loop first-person world simulator that incorporates an updatable 3D world state representation, integrating geometry-aware action-conditioned observation synthesis with interaction-aware state update mechanisms. The authors further introduce EgoCap, a low-cost system leveraging uncalibrated smartphones, and a scalable pipeline for extracting training data from in-the-wild monocular videos. The proposed approach significantly outperforms existing methods in visual fidelity, spatial consistency, and generalization to complex dexterous interactions, while enabling cross-embodiment transfer to robotic manipulation tasks.
📝 Abstract
We introduce EgoSim, a closed-loop egocentric world simulator that generates spatially consistent interaction videos and persistently updates the underlying 3D scene state for continuous simulation. Existing egocentric simulators either lack explicit 3D grounding, causing structural drift under viewpoint changes, or treat the scene as static, failing to update world states across multi-stage interactions. EgoSim addresses both limitations by modeling 3D scenes as updatable world states. We generate embodiment interactions via a Geometry-action-aware Observation Simulation model, with spatial consistency from an Interaction-aware State Updating module. To overcome the critical data bottleneck posed by the difficulty in acquiring densely aligned scene-interaction training pairs, we design a scalable pipeline that extracts static point clouds, camera trajectories, and embodiment actions from in-the-wild large-scale monocular egocentric videos. We further introduce EgoCap, a capture system that enables low-cost real-world data collection with uncalibrated smartphones. Extensive experiments demonstrate that EgoSim significantly outperforms existing methods in terms of visual quality, spatial consistency, and generalization to complex scenes and in-the-wild dexterous interactions, while supporting cross-embodiment transfer to robotic manipulation. Codes and datasets will be open soon. The project page is at egosimulator.github.io.
Problem

Research questions and friction points this paper is trying to address.

egocentric simulation
3D scene state
spatial consistency
embodied interaction
world state updating
Innovation

Methods, ideas, or system contributions that make the work stand out.

egocentric simulation
3D scene state updating
embodied interaction generation
geometry-action-aware modeling
in-the-wild data extraction
🔎 Similar Papers
2024-07-09IEEE/ASME transactions on mechatronicsCitations: 94
J
Jinkun Hao
Shanghai Jiao Tong University
M
Mingda Jia
Shanghai AI Laboratory
R
Ruiyan Wang
Shanghai Jiao Tong University
Xihui Liu
Xihui Liu
University of Hong Kong, UC Berkeley, CUHK, Tsinghua University
Computer VisionDeep Learning
Ran Yi
Ran Yi
Associate Professor, Shanghai Jiao Tong University
Computer VisionComputer Graphics
L
Lizhuang Ma
Shanghai Jiao Tong University
J
Jiangmiao Pang
Shanghai AI Laboratory
X
Xudong Xu
Shanghai AI Laboratory