AHOY! Animatable Humans under Occlusion from YouTube Videos with Gaussian Splatting and Video Diffusion Priors

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

261K/year

🤖 AI Summary

This work addresses the challenge of 3D human reconstruction from in-the-wild monocular videos under severe occlusion, a setting where existing methods often fail due to their reliance on unoccluded views and canonical poses. The proposed AHOY method introduces a “hallucination-as-supervision” mechanism that leverages an identity-tuned video diffusion model to generate dense supervisory signals for occluded regions. It employs a two-stage reconstruction pipeline: first recovering complete geometry in a canonical space, then mapping it to a pose-conditioned 3D Gaussian representation. By decoupling mapping pose from driving pose and adopting a head-body separated supervision strategy, AHOY effectively preserves facial identity and enhances multi-view consistency. Experiments demonstrate state-of-the-art performance on real-world YouTube videos with occlusions and multi-view datasets, enabling high-quality animation and seamless integration into mobile 3D scenes.

Technology Category

Application Category

📝 Abstract

We present AHOY, a method for reconstructing complete, animatable 3D Gaussian avatars from in-the-wild monocular video despite heavy occlusion. Existing methods assume unoccluded input-a fully visible subject, often in a canonical pose-excluding the vast majority of real-world footage where people are routinely occluded by furniture, objects, or other people. Reconstructing from such footage poses fundamental challenges: large body regions may never be observed, and multi-view supervision per pose is unavailable. We address these challenges with four contributions: (i) a hallucination-as-supervision pipeline that uses identity-finetuned diffusion models to generate dense supervision for previously unobserved body regions; (ii) a two-stage canonical-to-pose-dependent architecture that bootstraps from sparse observations to full pose-dependent Gaussian maps; (iii) a map-pose/LBS-pose decoupling that absorbs multi-view inconsistencies from the generated data; (iv) a head/body split supervision strategy that preserves facial identity. We evaluate on YouTube videos and on multi-view capture data with significant occlusion and demonstrate state-of-the-art reconstruction quality. We also demonstrate that the resulting avatars are robust enough to be animated with novel poses and composited into 3DGS scenes captured using cell-phone video. Our project page is available at https://miraymen.github.io/ahoy/

Problem

Research questions and friction points this paper is trying to address.

occlusion

animatable avatars

monocular video

3D reconstruction

in-the-wild

Innovation

Methods, ideas, or system contributions that make the work stand out.

Gaussian Splatting

Video Diffusion Priors

Occlusion-Robust Reconstruction