🤖 AI Summary
Standard evaluation of vision models in ecology and biology relies excessively on generic machine learning metrics (e.g., mAP), neglecting their impact on downstream scientific inference. Method: We propose an application-oriented evaluation paradigm, instantiated through two real-world case studies—chimpanzee population density estimation and pigeon head orientation inference—and introduce domain-specific metrics (e.g., relative density estimation error, orientation angular deviation) as primary benchmarks. Our methodology integrates video-based behavior classification, 3D pose estimation, camera-trap distance sampling, and cross-modal error propagation analysis. Contribution/Results: We demonstrate that models with high mAP can induce up to 37% error in density estimates; conversely, the top-performing pose model yields the largest orientation inference error. This work provides the first systematic empirical validation of the necessity of task-specific evaluation, advancing the integration of vision models into ecological and biological scientific workflows.
📝 Abstract
Computer vision methods have demonstrated considerable potential to streamline ecological and biological workflows, with a growing number of datasets and models becoming available to the research community. However, these resources focus predominantly on evaluation using machine learning metrics, with relatively little emphasis on how their application impacts downstream analysis. We argue that models should be evaluated using application-specific metrics that directly represent model performance in the context of its final use case. To support this argument, we present two disparate case studies: (1) estimating chimpanzee abundance and density with camera trap distance sampling when using a video-based behaviour classifier and (2) estimating head rotation in pigeons using a 3D posture estimator. We show that even models with strong machine learning performance (e.g., 87% mAP) can yield data that leads to discrepancies in abundance estimates compared to expert-derived data. Similarly, the highest-performing models for posture estimation do not produce the most accurate inferences of gaze direction in pigeons. Motivated by these findings, we call for researchers to integrate application-specific metrics in ecological/biological datasets, allowing for models to be benchmarked in the context of their downstream application and to facilitate better integration of models into application workflows.