🤖 AI Summary
Fungal image zero-shot classification faces challenges due to scarcity of real annotated data and difficulty in semantic alignment across growth stages. Method: This paper proposes a growth-stage-aware synthetic data generation paradigm: fine-grained textual descriptions are generated using LLaMA3.2, then paired fungal images are synthesized via controllable image generation; both modalities are aligned within CLIP’s shared embedding space. Crucially, growth-stage knowledge is explicitly encoded as a prior constraint guiding text–image co-generation. Contribution/Results: To our knowledge, this is the first work to incorporate explicit growth-stage semantics into multimodal synthetic data construction. We systematically evaluate how LLM-generated text quality affects cross-stage knowledge transfer. Experiments demonstrate significant improvements in CLIP’s zero-shot classification accuracy—particularly for early growth stages—establishing a scalable, interpretable framework for few-shot biological image recognition.
📝 Abstract
The effectiveness of zero-shot classification in large vision-language models (VLMs), such as Contrastive Language-Image Pre-training (CLIP), depends on access to extensive, well-aligned text-image datasets. In this work, we introduce two complementary data sources, one generated by large language models (LLMs) to describe the stages of fungal growth and another comprising a diverse set of synthetic fungi images. These datasets are designed to enhance CLIPs zero-shot classification capabilities for fungi-related tasks. To ensure effective alignment between text and image data, we project them into CLIPs shared representation space, focusing on different fungal growth stages. We generate text using LLaMA3.2 to bridge modality gaps and synthetically create fungi images. Furthermore, we investigate knowledge transfer by comparing text outputs from different LLM techniques to refine classification across growth stages.