BioBench: A Blueprint to Move Beyond ImageNet for Scientific ML Benchmarks

📅 2025-11-20
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
ImageNet-1K linear probe accuracy fails to reliably predict model performance on ecological imagery—explaining only 34% of variance and misranking 30% of high-accuracy models. Method: We introduce BioBench, the first systematic benchmark for scientific vision, covering four biological domains, six imaging modalities, and nine application tasks across 3.1 million images. It employs frozen backbone + linear probe evaluation with class-balanced macro-F1 and supports efficient, unified evaluation of ViT-family and other models via a standardized Python API. Contribution/Results: Validated on 46 modern vision models, BioBench significantly improves cross-task performance prediction and model ranking reliability. Full evaluation of ViT-L on the entire benchmark requires only six hours. BioBench establishes a reusable, science-aware benchmarking paradigm for AI for Science.

Technology Category

Application Category

📝 Abstract
ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.
Problem

Research questions and friction points this paper is trying to address.

ImageNet fails to predict performance on scientific imagery tasks
Existing benchmarks lack ecological diversity and application relevance
Need standardized evaluation for AI models in scientific domains
Innovation

Methods, ideas, or system contributions that make the work stand out.

BioBench provides a new ecology vision benchmark
It unifies diverse datasets and acquisition modalities
Offers a Python API for efficient model evaluation
🔎 Similar Papers
No similar papers found.