🤖 AI Summary
Existing 2D avatar generation methods struggle to simultaneously achieve fine-grained facial expression modeling and cross-expression identity consistency. To address this, we propose a personalized avatar generation framework based on a multimodal diffusion Transformer. Our key contributions are: (1) joint identity-expression representation learning, which disentangles and co-models identity and expression features; (2) a consistency-aware attention mechanism that enforces identity stability across expressions via shared attention weights and explicit inference-time constraints; and (3) end-to-end fine-grained expression synthesis. Evaluated on standard benchmarks, our method significantly outperforms state-of-the-art approaches in expression accuracy, identity preservation, and cross-expression consistency. The framework enables high-fidelity, photorealistic virtual interactions and content creation with robust identity coherence across diverse expressions.
📝 Abstract
Different forms of customized 2D avatars are widely used in gaming applications, virtual communication, education, and content creation. However, existing approaches often fail to capture fine-grained facial expressions and struggle to preserve identity across different expressions. We propose GEN-AFFECT, a novel framework for personalized avatar generation that generates expressive and identity-consistent avatars with a diverse set of facial expressions. Our framework proposes conditioning a multimodal diffusion transformer on an extracted identity-expression representation. This enables identity preservation and representation of a wide range of facial expressions. GEN-AFFECT additionally employs consistent attention at inference for information sharing across the set of generated expressions, enabling the generation process to maintain identity consistency over the array of generated fine-grained expressions. GEN-AFFECT demonstrates superior performance compared to previous state-of-the-art methods on the basis of the accuracy of the generated expressions, the preservation of the identity and the consistency of the target identity across an array of fine-grained facial expressions.