Seeing as Experts Do: A Knowledge-Augmented Agent for Open-Set Fine-Grained Visual Understanding

📅 2026-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing fine-grained visual understanding methods are constrained by closed-set category systems and single-label predictions, leading to significant performance degradation in open-world or context-dependent scenarios and lacking interpretable reasoning capabilities. To address this, this work proposes KFRA, a knowledge-enhanced fine-grained reasoning agent that emulates expert analysis through a three-stage closed-loop inference process: generating hypotheses via open-vocabulary detection and web retrieval, aligning textual knowledge with visual evidence to localize discriminative regions, and performing interpretable reasoning by fusing multimodal evidence using large models. The core innovation lies in a novel retrieval–localization coupling mechanism that transforms external knowledge into spatially grounded evidence, enabling task-agnostic, fact-based, and interpretable open-set fine-grained reasoning. On the newly introduced FGExpertBench benchmark, KFRA achieves up to a 19% improvement in reasoning accuracy, substantially outperforming current large models and agent-based approaches.

Technology Category

Application Category

📝 Abstract
Fine-grained visual understanding is shifting from static classification to knowledge-augmented reasoning, where models must justify as well as recognise. Existing approaches remain limited by closed-set taxonomies and single-label prediction, leading to significant degradation under open-set or context-dependent conditions. We present the Knowledge-Augmented Fine-Grained Reasoning Agent (KFRA), a unified framework that transforms fine-grained perception into evidence-driven reasoning. KFRA operates through a three-stage closed reasoning loop that emulates expert analysis. It first performs open-vocabulary detection and web-scale retrieval to generate category hypotheses. It then conducts discriminative regions localisation by aligning textual knowledge with visual evidence through a global-to-local focusing mechanism. Finally, it integrates all multimodal evidence within a large multimodal model to perform interpretable reasoning. Unlike existing agents that treat retrieval and reasoning as independent processes, KFRA establishes a retrieval-grounding coupling that converts retrieved knowledge into spatially grounded evidence for verification. This design enables factual, interpretable, and task-agnostic reasoning across diverse fine-grained scenarios. To evaluate this capability, we construct FGExpertBench, a benchmark designed to assess reasoning depth and cross-task generalisation across six knowledge dimensions. Extensive experiments demonstrate that KFRA consistently surpasses both standalone large multimodal models and current agent frameworks, achieving up to 19 percent improvement in reasoning accuracy and delivering evidence-grounded interpretability in open-set fine-grained visual understanding.
Problem

Research questions and friction points this paper is trying to address.

fine-grained visual understanding
open-set recognition
knowledge-augmented reasoning
interpretable reasoning
multimodal reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

knowledge-augmented reasoning
open-set fine-grained visual understanding
retrieval-grounding coupling
multimodal evidence integration
interpretable reasoning
🔎 Similar Papers
No similar papers found.
J
Junhan Chen
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Zilu Zhou
Zilu Zhou
Ph.D, University of Pennsylvania
Statistical GenomicsStatistical modelingDeep learningBioinformaticsComputational Biology
Y
Yujun Tong
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
D
Dongliang Chang
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Y
Yitao Luo
School of Artificial Intelligence, Beijing University of Posts and Telecommunications, China
Zhanyu Ma
Zhanyu Ma
Beijing University of Posts and Telecommunications
Pattern RecognitionMachine LearningComputer VisionMultimedia TechnologyDeep Learning