TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge that existing text-to-image generation models struggle to accurately capture subtle morphological distinctions among species, often resulting in identity distortion in fine-grained biological image synthesis. To overcome this limitation, we propose a lightweight and efficient approach that injects embeddings from vision-based classification models—such as BioCLIP—into a frozen diffusion model, thereby guiding the generation of high-fidelity species-specific images while preserving textual controllability over attributes like pose, style, and background. As the first study to integrate visual classification models with diffusion-based generative frameworks, we introduce a novel evaluation metric grounded in multimodal large language models. Our method significantly improves both morphological fidelity and species identification accuracy in generated images, demonstrating strong generalization even under few-shot conditions and for previously unseen species.
📝 Abstract
Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.
Problem

Research questions and friction points this paper is trying to address.

fine-grained image generation
species identity
Tree of Life
visual traits
text-to-image synthesis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Taxonomy Models
fine-grained image generation
diffusion models
species identity
multimodal evaluation
🔎 Similar Papers
No similar papers found.