LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

This work addresses the limited expressiveness and reconstruction artifacts in 3D Gaussian splatting avatars driven by sparse motion cues from monocular videos. To overcome these challenges, we propose a high-fidelity avatar animation method based on kinematic space completion. By completing sparse observations in both facial expression and head pose spaces, and integrating a multi-granularity expression control mechanism with a multi-reference frame conditioning strategy, our approach enables fine-grained and controllable expression synthesis. We employ a video diffusion Transformer architecture that jointly leverages shading maps and expression coefficients, further enhanced by multi-reference image conditioning to improve temporal coherence and 3D consistency. As a plug-and-play module, our method significantly boosts animation quality, quantitative metrics, and expression diversity—particularly for extreme or unseen expressions—while effectively mitigating reconstruction artifacts in existing 3D avatar systems.

Technology Category

Application Category

📝 Abstract

We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.

Problem

Research questions and friction points this paper is trying to address.

3D avatar

expression control

kinematic completion

monocular video

reconstruction artifacts

Innovation

Methods, ideas, or system contributions that make the work stand out.

kinematic-space completion

expression-controlled animation

3D Gaussian Splatting