Dream, Lift, Animate: From Single Images to Animatable Gaussian Avatars

📅 2025-07-21

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

This work addresses key challenges in single-image 3D human avatar reconstruction—namely, geometric detail loss, animation artifacts, and low rendering efficiency. To this end, we propose a novel framework that jointly leverages video diffusion models to synthesize multi-view images, 3D Gaussian Splatting for geometry refinement, and a pose-aware Gaussian mapping mechanism in UV space. Specifically, a Transformer-based encoder projects unstructured 3D Gaussians onto the parameterized UV space of a canonical human body model, explicitly modeling global spatial relationships to ensure high-fidelity texture reconstruction and cross-pose consistency. Evaluated on ActorsHQ and 4D-Dress, our method achieves significant improvements in perceptual quality and photometric accuracy. It enables real-time rendering and end-to-end, post-processing-free animation driving directly from a single input image. This work establishes a new paradigm for monocular, animatable 3D human reconstruction.

Technology Category

Application Category

📝 Abstract

We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.

Problem

Research questions and friction points this paper is trying to address.

Reconstruct animatable 3D avatars from single images

Leverage multi-view generation and 3D Gaussian lifting

Enable real-time rendering and pose-aware animation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages video diffusion for multi-view generation

Uses 3D Gaussian lifting for unstructured reconstruction

Employs pose-aware UV-space mapping for animation

🔎 Similar Papers

No similar papers found.