BIOCAP: Exploiting Synthetic Captions Beyond Labels in Biological Foundation Models

📅 2025-10-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Biological image analysis suffers from scarce natural language descriptions and overreliance on coarse-grained class labels, limiting fine-grained semantic understanding. Method: We propose leveraging generative multimodal large language models (MLLMs) to synthesize instance-level descriptive captions as auxiliary supervision. Specifically, we fuse Wikipedia visual content with taxon-specific templates to automatically generate biologically meaningful, instance-level captions, which are then used to train BIOCAP—a biological multimodal foundation model optimized for image–text alignment. Contribution/Results: This work presents the first systematic validation of fine-grained textual supervision beyond class labels for biological visual representation learning. Experiments demonstrate that BIOCAP significantly outperforms baseline models on species classification and cross-modal retrieval tasks. The synthesized captions effectively enhance alignment between visual and semantic spaces, establishing a novel paradigm for multimodal modeling in data-scarce biological domains.

Technology Category

Application Category

📝 Abstract
This work investigates descriptive captions as an additional source of supervision for biological multimodal foundation models. Images and captions can be viewed as complementary samples from the latent morphospace of a species, each capturing certain biological traits. Incorporating captions during training encourages alignment with this shared latent structure, emphasizing potentially diagnostic characters while suppressing spurious correlations. The main challenge, however, lies in obtaining faithful, instance-specific captions at scale. This requirement has limited the utilization of natural language supervision in organismal biology compared with many other scientific domains. We complement this gap by generating synthetic captions with multimodal large language models (MLLMs), guided by Wikipedia-derived visual information and taxon-tailored format examples. These domain-specific contexts help reduce hallucination and yield accurate, instance-based descriptive captions. Using these captions, we train BIOCAP (i.e., BIOCLIP with Captions), a biological foundation model that captures rich semantics and achieves strong performance in species classification and text-image retrieval. These results demonstrate the value of descriptive captions beyond labels in bridging biological images with multimodal foundation models.
Problem

Research questions and friction points this paper is trying to address.

Generating synthetic captions for biological images using multimodal language models
Training biological foundation models with descriptive captions beyond simple labels
Improving species classification and text-image retrieval through caption supervision
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generates synthetic captions using multimodal language models
Trains biological foundation model with descriptive captions
Uses Wikipedia and taxon examples to reduce hallucination
🔎 Similar Papers
No similar papers found.
Z
Ziheng Zhang
The Ohio State University
X
Xinyue Ma
The Ohio State University
A
Arpita Chowdhury
The Ohio State University
E
Elizabeth G. Campolongo
The Ohio State University
M
Matthew J. Thompson
The Ohio State University
N
Net Zhang
The Ohio State University
Samuel Stevens
Samuel Stevens
PhD student, The Ohio State University
Natural language processing
Hilmar Lapp
Hilmar Lapp
Director of Informatics, Center for Genomic and Computational Biology, Duke University
BioinformaticsEvolutionPhylogeneticsDatabasesData Integration
Tanya Berger-Wolf
Tanya Berger-Wolf
Professor of Computer Science and Engineering, Ohio State University
Imageomicscomputational ecologyAI for natureAI for biodiversityAI for conservation
Y
Yu Su
The Ohio State University
W
Wei-Lun Chao
The Ohio State University
Jianyang Gu
Jianyang Gu
The Ohio State University
ImageomicsDataset DistillationData-centric AI