LiftAvatar: Kinematic-Space Completion for Expression-Controlled 3D Gaussian Avatar Animation

📅 2026-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited expressiveness and reconstruction artifacts in 3D Gaussian splatting avatars driven by sparse motion cues from monocular videos. To overcome these challenges, we propose a high-fidelity avatar animation method based on kinematic space completion. By completing sparse observations in both facial expression and head pose spaces, and integrating a multi-granularity expression control mechanism with a multi-reference frame conditioning strategy, our approach enables fine-grained and controllable expression synthesis. We employ a video diffusion Transformer architecture that jointly leverages shading maps and expression coefficients, further enhanced by multi-reference image conditioning to improve temporal coherence and 3D consistency. As a plug-and-play module, our method significantly boosts animation quality, quantitative metrics, and expression diversity—particularly for extreme or unseen expressions—while effectively mitigating reconstruction artifacts in existing 3D avatar systems.

Technology Category

Application Category

📝 Abstract
We present LiftAvatar, a new paradigm that completes sparse monocular observations in kinematic space (e.g., facial expressions and head pose) and uses the completed signals to drive high-fidelity avatar animation. LiftAvatar is a fine-grained, expression-controllable large-scale video diffusion Transformer that synthesizes high-quality, temporally coherent expression sequences conditioned on single or multiple reference images. The key idea is to lift incomplete input data into a richer kinematic representation, thereby strengthening both reconstruction and animation in downstream 3D avatar pipelines. To this end, we introduce (i) a multi-granularity expression control scheme that combines shading maps with expression coefficients for precise and stable driving, and (ii) a multi-reference conditioning mechanism that aggregates complementary cues from multiple frames, enabling strong 3D consistency and controllability. As a plug-and-play enhancer, LiftAvatar directly addresses the limited expressiveness and reconstruction artifacts of 3D Gaussian Splatting-based avatars caused by sparse kinematic cues in everyday monocular videos. By expanding incomplete observations into diverse pose-expression variations, LiftAvatar also enables effective prior distillation from large-scale video generative models into 3D pipelines, leading to substantial gains. Extensive experiments show that LiftAvatar consistently boosts animation quality and quantitative metrics of state-of-the-art 3D avatar methods, especially under extreme, unseen expressions.
Problem

Research questions and friction points this paper is trying to address.

3D avatar
expression control
kinematic completion
monocular video
reconstruction artifacts
Innovation

Methods, ideas, or system contributions that make the work stand out.

kinematic-space completion
expression-controlled animation
3D Gaussian Splatting
multi-reference conditioning
video diffusion Transformer
🔎 Similar Papers
No similar papers found.
H
Hualiang Wei
College of Computer Science and Technology, Jilin University, No. 2699 Qianjin Street, Changchun, 130012, Jilin, China
S
Shunran Jia
Impressed Inc DBA SocialBook, 950 Tower Ln, Foster City, 94404, California, USA
Jialun Liu
Jialun Liu
Baidu | JLU
long-tailed data learningmetric learning3D generation
Wenhui Li
Wenhui Li
National Institute of Biological Sciences,Beijing