VIKSER: Visual Knowledge-Driven Self-Reinforcing Reasoning Framework

📅 2025-02-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision reasoning methods suffer from three key challenges: ambiguous problem descriptions, insufficient fine-grained visual understanding, and opaque, uninterpretable reasoning processes. To address these, we propose a knowledge-driven self-reinforcing reasoning framework. Our core contributions are: (1) the novel Chain-of-Evidence (CoE) prompting paradigm, which explicitly models evidence chains to enhance reasoning traceability and interpretability; (2) a vision-relation-guided fine-grained knowledge distillation mechanism that transfers structured visual knowledge—encoded by object detection models—into large language models; and (3) a self-reflective error correction module enabling iterative refinement of reasoning outputs. Evaluated on mainstream vision reasoning benchmarks, our approach achieves new state-of-the-art performance, significantly improving accuracy while enhancing interpretability, controllability, and robustness of the reasoning process.

Technology Category

Application Category

📝 Abstract
Visual reasoning refers to the task of solving questions about visual information. Current visual reasoning methods typically employ pre-trained vision-language model (VLM) strategies or deep neural network approaches. However, existing efforts are constrained by limited reasoning interpretability, while hindering by the phenomenon of underspecification in the question text. Additionally, the absence of fine-grained visual knowledge limits the precise understanding of subject behavior in visual reasoning tasks. To address these issues, we propose VIKSER (Visual Knowledge-Driven Self-Reinforcing Reasoning Framework). Specifically, VIKSER, trained using knowledge distilled from large language models, extracts fine-grained visual knowledge with the assistance of visual relationship detection techniques. Subsequently, VIKSER utilizes fine-grained visual knowledge to paraphrase the question with underspecification. Additionally, we design a novel prompting method called Chain-of-Evidence (CoE), which leverages the power of ``evidence for reasoning'' to endow VIKSER with interpretable reasoning capabilities. Meanwhile, the integration of self-reflection technology empowers VIKSER with the ability to learn and improve from its mistakes. Experiments conducted on widely used datasets demonstrate that VIKSER achieves new state-of-the-art (SOTA) results in relevant tasks.
Problem

Research questions and friction points this paper is trying to address.

Visual Reasoning
Clarity of Problem Description
Model Interpretability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Visual Reasoning
Evidence Chain Method
Self-Reflection Learning
🔎 Similar Papers
No similar papers found.
C
Chunbai Zhang
1. School of Future Technology, Shanghai University, Shanghai, 200444, China; 2. Institute of Artificial Intelligence, Shanghai University, Shanghai, 200444, China
C
Chao Wang
1. School of Future Technology, Shanghai University, Shanghai, 200444, China; 2. Institute of Artificial Intelligence, Shanghai University, Shanghai, 200444, China
Y
Yang Zhou
3. School of Mechatronic Engineering and Automation, Shanghai, 200444, China
Yan Peng
Yan Peng
Professor, Shanghai University
Robotics