UniEgoMotion: A Unified Model for Egocentric Motion Reconstruction, Forecasting, and Generation

📅 2025-08-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing methods rely on structured 3D scene representations, rendering them ill-suited for egocentric motion synthesis under constrained fields of view, frequent occlusions, and dynamic camera motion. This paper introduces UniEgoMotion—the first unified framework for monocular egocentric motion reconstruction, prediction, and generation without explicit 3D scene modeling. Its core innovations are: (1) a head-centric motion representation tailored to egocentric geometry; (2) a diffusion-based conditional motion generation model that jointly integrates an egocentric visual encoder and a motion decoder; and (3) EE4D-Motion, the first large-scale egocentric 4D motion dataset. Experiments demonstrate state-of-the-art performance in motion reconstruction and, for the first time, enable high-fidelity generation of future motion sequences from a single egocentric image—significantly advancing real-time motion simulation capabilities for AR/VR and human–computer interaction.

Technology Category

Application Category

📝 Abstract
Egocentric human motion generation and forecasting with scene-context is crucial for enhancing AR/VR experiences, improving human-robot interaction, advancing assistive technologies, and enabling adaptive healthcare solutions by accurately predicting and simulating movement from a first-person perspective. However, existing methods primarily focus on third-person motion synthesis with structured 3D scene contexts, limiting their effectiveness in real-world egocentric settings where limited field of view, frequent occlusions, and dynamic cameras hinder scene perception. To bridge this gap, we introduce Egocentric Motion Generation and Egocentric Motion Forecasting, two novel tasks that utilize first-person images for scene-aware motion synthesis without relying on explicit 3D scene. We propose UniEgoMotion, a unified conditional motion diffusion model with a novel head-centric motion representation tailored for egocentric devices. UniEgoMotion's simple yet effective design supports egocentric motion reconstruction, forecasting, and generation from first-person visual inputs in a unified framework. Unlike previous works that overlook scene semantics, our model effectively extracts image-based scene context to infer plausible 3D motion. To facilitate training, we introduce EE4D-Motion, a large-scale dataset derived from EgoExo4D, augmented with pseudo-ground-truth 3D motion annotations. UniEgoMotion achieves state-of-the-art performance in egocentric motion reconstruction and is the first to generate motion from a single egocentric image. Extensive evaluations demonstrate the effectiveness of our unified framework, setting a new benchmark for egocentric motion modeling and unlocking new possibilities for egocentric applications.
Problem

Research questions and friction points this paper is trying to address.

Enhances AR/VR with first-person motion prediction
Overcomes limited field of view in egocentric settings
Generates scene-aware motion without explicit 3D data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified conditional motion diffusion model
Head-centric motion representation
Image-based scene context extraction
🔎 Similar Papers
No similar papers found.