🤖 AI Summary
Existing cartoon avatar generation methods suffer from insufficient fine-grained facial expression diversity and privacy risks stemming from reliance on real identities. This paper proposes GenEAva, an end-to-end framework that achieves the first precise translation of expressive real faces—including 135 fine-grained expressions—into identity- and expression-preserving cartoon avatars. Built upon fine-tuned SDXL, GenEAva integrates controllable expression synthesis with identity- and expression-aware style transfer, augmented by bias-mitigating sampling and a rigorous privacy evaluation protocol. Key contributions include: (1) introducing GenEAva 1.0—the first large-scale, multi-dimensionally balanced expressive cartoon avatar dataset (13,230 samples); (2) generating avatars with significantly richer expression diversity than baselines while ensuring balanced gender, race, and age distributions; and (3) empirically verifying zero identity memorization via reverse-identification and membership inference attacks, thus achieving a principled trade-off between expressiveness and privacy. Code, models, and dataset are fully open-sourced.
📝 Abstract
Cartoon avatars have been widely used in various applications, including social media, online tutoring, and gaming. However, existing cartoon avatar datasets and generation methods struggle to present highly expressive avatars with fine-grained facial expressions and are often inspired from real-world identities, raising privacy concerns. To address these challenges, we propose a novel framework, GenEAva, for generating high-quality cartoon avatars with fine-grained facial expressions. Our approach fine-tunes a state-of-the-art text-to-image diffusion model to synthesize highly detailed and expressive facial expressions. We then incorporate a stylization model that transforms these realistic faces into cartoon avatars while preserving both identity and expression. Leveraging this framework, we introduce the first expressive cartoon avatar dataset, GenEAva 1.0, specifically designed to capture 135 fine-grained facial expressions, featuring 13,230 expressive cartoon avatars with a balanced distribution across genders, racial groups, and age ranges. We demonstrate that our fine-tuned model generates more expressive faces than the state-of-the-art text-to-image diffusion model SDXL. We also verify that the cartoon avatars generated by our framework do not include memorized identities from fine-tuning data. The proposed framework and dataset provide a diverse and expressive benchmark for future research in cartoon avatar generation.