EgoAnimate: Generating Human Animations from Egocentric top-down Views

📅 2025-07-12

📈 Citations: 0

✨ Influential: 0

career value

213K/year

🤖 AI Summary

This work addresses the challenge of generating drivable 3D human avatars from a single egocentric top-down image, where severe occlusion and non-rigid body deformations impede accurate reconstruction. We propose an end-to-end generative prior-based method that uniquely integrates Stable Diffusion with ControlNet to establish a controllable pipeline mapping top-down views to frontal-body appearances, seamlessly coupled with an image-to-motion transfer model. Crucially, the framework operates solely on a single-view input during both training and inference—requiring no multi-view annotations or auxiliary sensors—thereby enhancing generalizability and deployment efficiency. Experiments demonstrate effective mitigation of intrinsic top-down geometric distortion, enabling high-fidelity, animatable 3D human synthesis under low hardware requirements. The approach advances egocentric vision-driven avatar generation, offering a novel paradigm for lightweight digital telepresence applications.

Technology Category

Application Category

📝 Abstract

An ideal digital telepresence experience requires accurate replication of a person's body, clothing, and movements. To capture and transfer these movements into virtual reality, the egocentric (first-person) perspective can be adopted, which enables the use of a portable and cost-effective device without front-view cameras. However, this viewpoint introduces challenges such as occlusions and distorted body proportions. There are few works reconstructing human appearance from egocentric views, and none use a generative prior-based approach. Some methods create avatars from a single egocentric image during inference, but still rely on multi-view datasets during training. To our knowledge, this is the first study using a generative backbone to reconstruct animatable avatars from egocentric inputs. Based on Stable Diffusion, our method reduces training burden and improves generalizability. Inspired by methods such as SiTH and MagicMan, which perform 360-degree reconstruction from a frontal image, we introduce a pipeline that generates realistic frontal views from occluded top-down images using ControlNet and a Stable Diffusion backbone. Our goal is to convert a single top-down egocentric image into a realistic frontal representation and feed it into an image-to-motion model. This enables generation of avatar motions from minimal input, paving the way for more accessible and generalizable telepresence systems.

Problem

Research questions and friction points this paper is trying to address.

Reconstructing animatable avatars from egocentric top-down views

Overcoming occlusions and distortions in first-person perspectives

Generating realistic frontal animations from minimal input images

Innovation

Methods, ideas, or system contributions that make the work stand out.

Generative backbone for egocentric avatar reconstruction

ControlNet with Stable Diffusion for frontal view generation

Single top-down image to motion conversion

🔎 Similar Papers

No similar papers found.