CAG-Avatar: Cross-Attention Guided Gaussian Avatars for High-Fidelity Head Reconstruction

📅 2025-11-08
🏛️ IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing 3D Gaussian splatting-based animatable avatar methods employ a global, uniform expression code, which struggles to capture the heterogeneous dynamic behaviors of distinct facial regions—such as skin versus teeth—leading to blurred or distorted reconstructions. To address this limitation, this work proposes a conditionally adaptive Gaussian avatar framework that leverages cross-attention mechanisms to enable each Gaussian point to adaptively extract locally relevant driving signals from the global expression code based on its spatial location. This approach enables fine-grained regional control and overcomes the constraints of conventional global driving schemes. The method significantly enhances reconstruction fidelity—particularly in complex regions like teeth—while maintaining real-time rendering performance.

Technology Category

Application Category

📝 Abstract
Creating high-fidelity, real-time drivable 3D head avatars is a core challenge in digital animation. While 3D Gaussian Splashing (3D-GS) offers unprecedented rendering speed and quality, current animation techniques often rely on a "one-size-fits-all" global tuning approach, where all Gaussian primitives are uniformly driven by a single expression code. This simplistic approach fails to unravel the distinct dynamics of different facial regions, such as deformable skin versus rigid teeth, leading to significant blurring and distortion artifacts. We introduce Conditionally-Adaptive Gaussian Avatars (CAG-Avatar), a framework that resolves this key limitation. At its core is a Conditionally Adaptive Fusion Module built on cross-attention. This mechanism empowers each 3D Gaussian to act as a query, adaptively extracting relevant driving signals from the global expression code based on its canonical position. This "tailor-made" conditioning strategy drastically enhances the modeling of fine-grained, localized dynamics. Our experiments confirm a significant improvement in reconstruction fidelity, particularly for challenging regions such as teeth, while preserving real-time rendering performance.
Problem

Research questions and friction points this paper is trying to address.

3D Gaussian Splatting
head avatar
facial dynamics
expression-driven animation
high-fidelity reconstruction
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Attention
3D Gaussian Splatting
Conditional Adaptation
High-Fidelity Avatar
Local Dynamics Modeling
Z
Zhe Chang
Dept. of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
H
Haodong Jin
Dept. of Control Science and Engineering, University of Shanghai for Science and Technology, Shanghai, China
Yan Song
Yan Song
University of Shanghai for Science and Technology
Model predictive controlMachine learning and data analysisImage processing and intelligent systems
Hui Yu
Hui Yu
Professor of Visual and Cognitive Computing, University of Glasgow
Visual ComputingCognitive ComputingSocial RobotParallel Intelligence