BioTrove: A Large Curated Image Dataset Enabling AI for Biodiversity

📅 2024-06-25
📈 Citations: 1
Influential: 1
📄 PDF
🤖 AI Summary
Biodiversity AI models suffer from poor generalization to rare species and ecologically critical taxa due to insufficient large-scale, high-quality, and taxonomically comprehensive image data. Method: We introduce BioTrove—the largest publicly available research-grade biodiversity image dataset to date—comprising 161.9 million images spanning ~366,000 species across Animalia, Plantae, and Fungi, annotated with scientific names, taxonomic hierarchies, and common names. We further propose the first multi-level zero-shot evaluation benchmark tailored for biodiversity conservation and release BioTrove-Train—a curated subset focused on agriculturally and ecologically vital groups (e.g., insects, birds, fungi)—alongside a specialized CLIP-based model. Contribution/Results: Experiments demonstrate substantial improvements in zero-shot recognition of juveniles, rare species, and morphologically similar congeners. The framework shows strong practical utility in real-world applications such as pest and disease identification and crop monitoring.

Technology Category

Application Category

📝 Abstract
We introduce BioTrove, the largest publicly accessible dataset designed to advance AI applications in biodiversity. Curated from the iNaturalist platform and vetted to include only research-grade data, BioTrove contains 161.9 million images, offering unprecedented scale and diversity from three primary kingdoms: Animalia ("animals"), Fungi ("fungi"), and Plantae ("plants"), spanning approximately 366.6K species. Each image is annotated with scientific names, taxonomic hierarchies, and common names, providing rich metadata to support accurate AI model development across diverse species and ecosystems. We demonstrate the value of BioTrove by releasing a suite of CLIP models trained using a subset of 40 million captioned images, known as BioTrove-Train. This subset focuses on seven categories within the dataset that are underrepresented in standard image recognition models, selected for their critical role in biodiversity and agriculture: Aves ("birds"), Arachnida ("spiders/ticks/mites"), Insecta ("insects"), Plantae ("plants"), Fungi ("fungi"), Mollusca ("snails"), and Reptilia ("snakes/lizards"). To support rigorous assessment, we introduce several new benchmarks and report model accuracy for zero-shot learning across life stages, rare species, confounding species, and multiple taxonomic levels. We anticipate that BioTrove will spur the development of AI models capable of supporting digital tools for pest control, crop monitoring, biodiversity assessment, and environmental conservation. These advancements are crucial for ensuring food security, preserving ecosystems, and mitigating the impacts of climate change. BioTrove is publicly available, easily accessible, and ready for immediate use.
Problem

Research questions and friction points this paper is trying to address.

biodiversity
AI models
dataset diversity
Innovation

Methods, ideas, or system contributions that make the work stand out.

BioTrove
zero-shot learning
biodiversity dataset
🔎 Similar Papers
No similar papers found.
C
Chih-Hsuan Yang
Iowa State University, Ames, IA 50011, USA
B
Ben Feuer
New York University, New York, NY 10003, USA
Zaki Jubery
Zaki Jubery
Assistant Scientist, Iowa State University
Z
Zi K. Deng
University of Arizona, Tucson, AZ 85721, USA
A
Andre Nakkab
New York University, New York, NY 10003, USA
Md Zahid Hasan
Md Zahid Hasan
Iowa State University, USA
Computer visionMulti-modal LearningVision-language modelsVideo action recognition
Shivani Chiranjeevi
Shivani Chiranjeevi
PhD Student, Iowa State University
Kelly O. Marshall
Kelly O. Marshall
Ph.D. Candidate, NYU
Deep Learning3D Machine LearningGenerative Modeling
Nirmal Baishnab
Nirmal Baishnab
Iowa State University, Ames, IA 50011, USA
A
Asheesh K Singh
Iowa State University, Ames, IA 50011, USA
Arti Singh
Arti Singh
Department of Agronomy, Iowa State University of Science and Technology
Plant-based protein crop breedingPhenomicsHTPMachine LearningData Science
S
Soumik Sarkar
Iowa State University, Ames, IA 50011, USA
N
Nirav C. Merchant
University of Arizona, Tucson, AZ 85721, USA
Chinmay Hegde
Chinmay Hegde
New York University
AI
B
B. Ganapathysubramanian
Iowa State University, Ames, IA 50011, USA