CAG-Avatar: Cross-Attention Guided Gaussian Avatars for High-Fidelity Head Reconstruction

📅 2025-11-08

🏛️ IEEE/ASME International Conference on Mechatronic and Embedded Systems and Applications

📈 Citations: 0

✨ Influential: 0

career value

244K/year

🤖 AI Summary

Existing 3D Gaussian splatting-based animatable avatar methods employ a global, uniform expression code, which struggles to capture the heterogeneous dynamic behaviors of distinct facial regions—such as skin versus teeth—leading to blurred or distorted reconstructions. To address this limitation, this work proposes a conditionally adaptive Gaussian avatar framework that leverages cross-attention mechanisms to enable each Gaussian point to adaptively extract locally relevant driving signals from the global expression code based on its spatial location. This approach enables fine-grained regional control and overcomes the constraints of conventional global driving schemes. The method significantly enhances reconstruction fidelity—particularly in complex regions like teeth—while maintaining real-time rendering performance.

Technology Category

Application Category

📝 Abstract

Creating high-fidelity, real-time drivable 3D head avatars is a core challenge in digital animation. While 3D Gaussian Splashing (3D-GS) offers unprecedented rendering speed and quality, current animation techniques often rely on a "one-size-fits-all" global tuning approach, where all Gaussian primitives are uniformly driven by a single expression code. This simplistic approach fails to unravel the distinct dynamics of different facial regions, such as deformable skin versus rigid teeth, leading to significant blurring and distortion artifacts. We introduce Conditionally-Adaptive Gaussian Avatars (CAG-Avatar), a framework that resolves this key limitation. At its core is a Conditionally Adaptive Fusion Module built on cross-attention. This mechanism empowers each 3D Gaussian to act as a query, adaptively extracting relevant driving signals from the global expression code based on its canonical position. This "tailor-made" conditioning strategy drastically enhances the modeling of fine-grained, localized dynamics. Our experiments confirm a significant improvement in reconstruction fidelity, particularly for challenging regions such as teeth, while preserving real-time rendering performance.

Problem

Research questions and friction points this paper is trying to address.

3D Gaussian Splatting

head avatar

facial dynamics

expression-driven animation

high-fidelity reconstruction

Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Attention

3D Gaussian Splatting

Conditional Adaptation