๐ค AI Summary
In implicit knowledge visual question answering (IK-KVQA), multimodal large language models (MLLMs) suffer from inconsistent explanations and poor generalization due to the absence of explicit reasoning supervision.
Method: We propose a dual-path structured supervision framework that jointly leverages symbolic relation paths and natural language explanations to construct traceable, interpretable reasoning trajectories. Our approach employs offline trajectory construction, quality-aware filtering, and structured self-distillation fine-tuningโenabling end-to-end transparent reasoning without external knowledge bases or retrievers. Crucially, it requires only a single autoregressive generation pass.
Contribution/Results: The method significantly enhances model interpretability and robustness. On benchmarks including OK-VQA, it achieves up to 11.3% absolute accuracy improvement over the strongest baseline, while simultaneously improving cross-domain generalization and answer credibility.
๐ Abstract
Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.