๐ค AI Summary
Personalized animal image generation suffers from severe identity drift due to high inter-species appearance diversity and large anatomical variations, primarily caused by cross-domain feature misalignment. To address this, we propose AnimalBooth: (1) a lightweight AnimalNet backbone integrated with an adaptive attention module to enforce cross-modal identity feature alignment; (2) a discrete cosine transform (DCT)-based frequency-domain feature fusion mechanism enabling progressive generationโfrom global structure to fine-grained texture; and (3) a diffusion-based generative framework incorporating multimodal feature fusion and latent-space modulation. We train and evaluate the model on AnimalBench, a newly curated high-quality animal image dataset. Experiments demonstrate that AnimalBooth achieves state-of-the-art performance in both identity fidelity and visual quality. Moreover, AnimalBench establishes a valuable benchmark for future research in personalized animal image generation.
๐ Abstract
Personalized animal image generation is challenging due to rich appearance cues and large morphological variability. Existing approaches often exhibit feature misalignment across domains, which leads to identity drift. We present AnimalBooth, a framework that strengthens identity preservation with an Animal Net and an adaptive attention module, mitigating cross domain alignment errors. We further introduce a frequency controlled feature integration module that applies Discrete Cosine Transform filtering in the latent space to guide the diffusion process, enabling a coarse to fine progression from global structure to detailed texture. To advance research in this area, we curate AnimalBench, a high resolution dataset for animal personalization. Extensive experiments show that AnimalBooth consistently outperforms strong baselines on multiple benchmarks and improves both identity fidelity and perceptual quality.