🤖 AI Summary
This work addresses key challenges in single-image 3D human avatar reconstruction—namely, geometric detail loss, animation artifacts, and low rendering efficiency. To this end, we propose a novel framework that jointly leverages video diffusion models to synthesize multi-view images, 3D Gaussian Splatting for geometry refinement, and a pose-aware Gaussian mapping mechanism in UV space. Specifically, a Transformer-based encoder projects unstructured 3D Gaussians onto the parameterized UV space of a canonical human body model, explicitly modeling global spatial relationships to ensure high-fidelity texture reconstruction and cross-pose consistency. Evaluated on ActorsHQ and 4D-Dress, our method achieves significant improvements in perceptual quality and photometric accuracy. It enables real-time rendering and end-to-end, post-processing-free animation driving directly from a single input image. This work establishes a new paradigm for monocular, animatable 3D human reconstruction.
📝 Abstract
We introduce Dream, Lift, Animate (DLA), a novel framework that reconstructs animatable 3D human avatars from a single image. This is achieved by leveraging multi-view generation, 3D Gaussian lifting, and pose-aware UV-space mapping of 3D Gaussians. Given an image, we first dream plausible multi-views using a video diffusion model, capturing rich geometric and appearance details. These views are then lifted into unstructured 3D Gaussians. To enable animation, we propose a transformer-based encoder that models global spatial relationships and projects these Gaussians into a structured latent representation aligned with the UV space of a parametric body model. This latent code is decoded into UV-space Gaussians that can be animated via body-driven deformation and rendered conditioned on pose and viewpoint. By anchoring Gaussians to the UV manifold, our method ensures consistency during animation while preserving fine visual details. DLA enables real-time rendering and intuitive editing without requiring post-processing. Our method outperforms state-of-the-art approaches on ActorsHQ and 4D-Dress datasets in both perceptual quality and photometric accuracy. By combining the generative strengths of video diffusion models with a pose-aware UV-space Gaussian mapping, DLA bridges the gap between unstructured 3D representations and high-fidelity, animation-ready avatars.