TaxaAdapter: Vision Taxonomy Models are Key to Fine-grained Image Generation over the Tree of Life

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

This work addresses the challenge that existing text-to-image generation models struggle to accurately capture subtle morphological distinctions among species, often resulting in identity distortion in fine-grained biological image synthesis. To overcome this limitation, we propose a lightweight and efficient approach that injects embeddings from vision-based classification models—such as BioCLIP—into a frozen diffusion model, thereby guiding the generation of high-fidelity species-specific images while preserving textual controllability over attributes like pose, style, and background. As the first study to integrate visual classification models with diffusion-based generative frameworks, we introduce a novel evaluation metric grounded in multimodal large language models. Our method significantly improves both morphological fidelity and species identification accuracy in generated images, demonstrating strong generalization even under few-shot conditions and for previously unseen species.

Technology Category

Application Category

📝 Abstract

Accurately generating images across the Tree of Life is difficult: there are over 10M distinct species on Earth, many of which differ only by subtle visual traits. Despite the remarkable progress in text-to-image synthesis, existing models often fail to capture the fine-grained visual cues that define species identity, even when their outputs appear photo-realistic. To this end, we propose TaxaAdapter, a simple and lightweight approach that incorporates Vision Taxonomy Models (VTMs) such as BioCLIP to guide fine-grained species generation. Our method injects VTM embeddings into a frozen text-to-image diffusion model, improving species-level fidelity while preserving flexible text control over attributes such as pose, style, and background. Extensive experiments demonstrate that TaxaAdapter consistently improves morphology fidelity and species-identity accuracy over strong baselines, with a cleaner architecture and training recipe. To better evaluate these improvements, we also introduce a multimodal Large Language Model-based metric that summarizes trait-level descriptions from generated and real images, providing a more interpretable measure of morphological consistency. Beyond this, we observe that TaxaAdapter exhibits strong generalization capabilities, enabling species synthesis in challenging regimes such as few-shot species with only a handful of training images and even species unseen during training. Overall, our results highlight that VTMs are a key ingredient for scalable, fine-grained species generation.

Problem

Research questions and friction points this paper is trying to address.

fine-grained image generation

species identity

Tree of Life

visual traits

text-to-image synthesis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Vision Taxonomy Models

fine-grained image generation

diffusion models