🤖 AI Summary
This work addresses the challenge that existing text-to-image generation models struggle to accurately capture subtle morphological distinctions among species, often resulting in identity distortion in fine-grained biological image synthesis. To overcome this limitation, we propose a lightweight and efficient approach that injects embeddings from vision-based classification models—such as BioCLIP—into a frozen diffusion model, thereby guiding the generation of high-fidelity species-specific images while preserving textual controllability over attributes like pose, style, and background. As the first study to integrate visual classification models with diffusion-based generative frameworks, we introduce a novel evaluation metric grounded in multimodal large language models. Our method significantly improves both morphological fidelity and species identification accuracy in generated images, demonstrating strong generalization even under few-shot conditions and for previously unseen species.
📝 Abstract
Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.