GenLCA: 3D Diffusion for Full-Body Avatars from In-the-Wild Videos

📅 2026-04-08

📈 Citations: 0

✨ Influential: 0

career value

218K/year

🤖 AI Summary

This work addresses the challenging problem of generating high-fidelity, animatable full-body 3D avatars from monocular in-the-wild videos, particularly under data sparsity and partial occlusions that lead to reconstruction ambiguities and artifacts. To this end, the authors propose the first diffusion-based generative framework trained directly in 3D space using millions of real-world videos. The approach leverages a pretrained 3D avatar reconstruction model as a 3D tokenizer and introduces a visibility-aware token replacement and loss computation mechanism to effectively handle incomplete observations. By incorporating text or image conditioning, the method significantly enhances the visual fidelity, motion controllability, and diversity of the generated avatars, outperforming existing state-of-the-art techniques across multiple dimensions.

Technology Category

Application Category

📝 Abstract

We present GenLCA, a diffusion-based generative model for generating and editing photorealistic full-body avatars from text and image inputs. The generated avatars are faithful to the inputs, while supporting high-fidelity facial and full-body animations. The core idea is a novel paradigm that enables training a full-body 3D diffusion model from partially observable 2D data, allowing the training dataset to scale to millions of real-world videos. This scalability contributes to the superior photorealism and generalizability of GenLCA. Specifically, we scale up the dataset by repurposing a pretrained feed-forward avatar reconstruction model as an animatable 3D tokenizer, which encodes unstructured video frames into structured 3D tokens. However, most real-world videos only provide partial observations of body parts, resulting in excessive blurring or transparency artifacts in the 3D tokens. To address this, we propose a novel visibility-aware diffusion training strategy that replaces invalid regions with learnable tokens and computes losses only over valid regions. We then train a flow-based diffusion model on the token dataset, inherently maintaining the photorealism and animatability provided by the pretrained avatar reconstruction model. Our approach effectively enables the use of large-scale real-world video data to train a diffusion model natively in 3D. We demonstrate the efficacy of our method through diverse and high-fidelity generation and editing results, outperforming existing solutions by a large margin. The project page is available at https://onethousandwu.com/GenLCA-Page.

Problem

Research questions and friction points this paper is trying to address.

full-body avatars

3D diffusion

in-the-wild videos

partial observations

photorealism

Innovation

Methods, ideas, or system contributions that make the work stand out.

3D diffusion

full-body avatars

in-the-wild videos