FIKA-Bench: From Fine-grained Recognition to Fine-Grained Knowledge Acquisition

📅 2026-05-13

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

Existing visual recognition benchmarks primarily focus on closed-set tasks and are ill-suited for evaluating a model’s ability to actively acquire and verify external evidence when confronted with unknown fine-grained objects. To address this gap, this work introduces a new task—fine-grained knowledge acquisition—and presents FIKA-Bench, a leakage-aware, evidence-driven evaluation benchmark comprising 311 real-world instances. Through rigorous filtering, the benchmark excludes samples susceptible to memorization or image-to-answer leakage. An end-to-end agent framework is developed for evaluation, integrating large language multimodal models, tool invocation, entity retrieval, and visual reasoning modules. The best-performing system achieves only 25.1% accuracy, highlighting significant deficiencies in current models’ cross-modal retrieval and fine-grained discrimination capabilities, thereby underscoring the challenge and necessity of the proposed task.

📝 Abstract

Fine-grained recognition in everyday life is often not a closed-book classification problem: when encountering unfamiliar objects, humans actively search, compare visual details, and verify evidence before deciding. Existing benchmarks primarily evaluate visually recognition, leaving this active external knowledge acquisition ability underexplored. We study fine-grained knowledge acquisition, where a system must seek, verify, and use external evidence to answer open-ended fine-grained recognition questions. We introduce FIKA-Bench, a leakage-aware and evidence-grounded collection of 311 public-source and real-life instances. To ensure high quality, every example is filtered against frontier closed-book models to remove memorized cases and audited to eliminate image-answer leakage, retaining only samples supported by verified evidence. Our evaluation of latest Large Multimodal Models (LMMs) and agents reveals that the task remains a formidable challenge: the best system reaches only 25.1% accuracy, with no model exceeding 30%. Crucially, we find that merely equipping models with tools is insufficient to bridge this gap; agent failures are predominantly driven by wrong entity retrieval and poor visual judgement. These results show that reliable knowledge acquisition needs better agent designs that focus on fine-grained recognition.

Problem

Research questions and friction points this paper is trying to address.

fine-grained recognition

knowledge acquisition

external evidence

benchmark

multimodal models

Innovation

Methods, ideas, or system contributions that make the work stand out.

fine-grained knowledge acquisition

evidence-grounded benchmark

leakage-aware evaluation