🤖 AI Summary
ImageNet-1K linear probe accuracy fails to reliably predict model performance on ecological imagery—explaining only 34% of variance and misranking 30% of high-accuracy models. Method: We introduce BioBench, the first systematic benchmark for scientific vision, covering four biological domains, six imaging modalities, and nine application tasks across 3.1 million images. It employs frozen backbone + linear probe evaluation with class-balanced macro-F1 and supports efficient, unified evaluation of ViT-family and other models via a standardized Python API. Contribution/Results: Validated on 46 modern vision models, BioBench significantly improves cross-task performance prediction and model ranking reliability. Full evaluation of ViT-L on the entire benchmark requires only six hours. BioBench establishes a reusable, science-aware benchmarking paradigm for AI for Science.
📝 Abstract
ImageNet-1K linear-probe transfer accuracy remains the default proxy for visual representation quality, yet it no longer predicts performance on scientific imagery. Across 46 modern vision model checkpoints, ImageNet top-1 accuracy explains only 34% of variance on ecology tasks and mis-ranks 30% of models above 75% accuracy. We present BioBench, an open ecology vision benchmark that captures what ImageNet misses. BioBench unifies 9 publicly released, application-driven tasks, 4 taxonomic kingdoms, and 6 acquisition modalities (drone RGB, web video, micrographs, in-situ and specimen photos, camera-trap frames), totaling 3.1M images. A single Python API downloads data, fits lightweight classifiers to frozen backbones, and reports class-balanced macro-F1 (plus domain metrics for FishNet and FungiCLEF); ViT-L models evaluate in 6 hours on an A6000 GPU. BioBench provides new signal for computer vision in ecology and a template recipe for building reliable AI-for-science benchmarks in any domain. Code and predictions are available at https://github.com/samuelstevens/biobench and results at https://samuelstevens.me/biobench.