StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

๐Ÿ“… 2025-10-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
In implicit knowledge visual question answering (IK-KVQA), multimodal large language models (MLLMs) suffer from inconsistent explanations and poor generalization due to the absence of explicit reasoning supervision. Method: We propose a dual-path structured supervision framework that jointly leverages symbolic relation paths and natural language explanations to construct traceable, interpretable reasoning trajectories. Our approach employs offline trajectory construction, quality-aware filtering, and structured self-distillation fine-tuningโ€”enabling end-to-end transparent reasoning without external knowledge bases or retrievers. Crucially, it requires only a single autoregressive generation pass. Contribution/Results: The method significantly enhances model interpretability and robustness. On benchmarks including OK-VQA, it achieves up to 11.3% absolute accuracy improvement over the strongest baseline, while simultaneously improving cross-domain generalization and answer credibility.

Technology Category

Application Category

๐Ÿ“ Abstract
Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.
Problem

Research questions and friction points this paper is trying to address.

Improving implicit-knowledge visual question answering without external knowledge retrieval
Addressing inconsistent reasoning and poor generalization in multimodal language models
Enhancing answer accuracy and interpretability through structured reasoning traces
Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervises structured reasoning traces for transparency
Fine-tunes via structured self-distillation without external resources
Builds trace-enriched dataset offline for single-pass inference
๐Ÿ”Ž Similar Papers
No similar papers found.
Zhihao Wen
Zhihao Wen
Singapore Management University
Graph Neural NetworkLarge Language ModelParameter Efficient Fine-TuningMeta-learning
W
Wenkang Wei
University of Science and Technology of China, China
Y
Yuan Fang
Singapore Management University, Singapore
X
Xingtong Yu
Singapore Management University, Singapore
H
Hui Zhang
University of Science and Technology of China, China
Weicheng Zhu
Weicheng Zhu
Center for Data Science, New York University
Machine learning
X
Xin Zhang
Ant Group, China