🤖 AI Summary
Wild fine-grained bird identification in the field is often hindered by image occlusion, low resolution, or reliance on non-visual cues such as vocalizations, making species identification from a single image unreliable; however, existing models lack principled mechanisms to abstain from answering. To address this, this work introduces RealBirdID, a novel benchmark that incorporates an evidence-based, reason-aware refusal mechanism: the system must choose between providing a species prediction or explicitly refusing with a specific justification (e.g., “requires vocalization” or “image too blurry”). The benchmark includes paired answerable and unanswerable samples along with a human-validated test set. Experiments reveal that leading multimodal large language models (e.g., GPT-5, Gemini-2.5 Pro) achieve less than 13% accuracy on answerable samples, and high classification performance does not imply effective refusal capability—even when refusing, models rarely supply correct reasons. This work establishes a new paradigm for evaluating uncertainty calibration and explanatory fidelity.
📝 Abstract
Fine-grained bird species identification in the wild is frequently unanswerable from a single image: key cues may be non-visual (e.g. vocalization), or obscured due to occlusion, camera angle, or low resolution. Yet today's multimodal systems are typically judged on answerable, in-schema cases, encouraging confident guesses rather than principled abstention. We propose the RealBirdID benchmark: given an image of a bird, a system should either answer with a species or abstain with a concrete, evidence-based rationale: "requires vocalization," "low quality image," or "view obstructed". For each genus, the dataset includes a validation split composed of curated unanswerable examples with labeled rationales, paired with a companion set of clearly answerable instances. We find that (1) the species identification on the answerable set is challenging for a variety of open-source and proprietary models (less than 13% accuracy for MLLMs including GPT-5 and Gemini-2.5 Pro), (2) models with greater classification ability are not necessarily more calibrated to abstain from unanswerable examples, and (3) that MLLMs generally fail at providing correct reasons even when they do abstain. RealBirdID establishes a focused target for abstention-aware fine-grained recognition and a recipe for measuring progress.