StaR-KVQA: Structured Reasoning Traces for Implicit-Knowledge Visual Question Answering

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

161K/year

🤖 AI Summary

In implicit knowledge visual question answering (IK-KVQA), multimodal large language models (MLLMs) suffer from inconsistent explanations and poor generalization due to the absence of explicit reasoning supervision. Method: We propose a dual-path structured supervision framework that jointly leverages symbolic relation paths and natural language explanations to construct traceable, interpretable reasoning trajectories. Our approach employs offline trajectory construction, quality-aware filtering, and structured self-distillation fine-tuning—enabling end-to-end transparent reasoning without external knowledge bases or retrievers. Crucially, it requires only a single autoregressive generation pass. Contribution/Results: The method significantly enhances model interpretability and robustness. On benchmarks including OK-VQA, it achieves up to 11.3% absolute accuracy improvement over the strongest baseline, while simultaneously improving cross-domain generalization and answer credibility.

Technology Category

Application Category

📝 Abstract

Knowledge-based Visual Question Answering (KVQA) requires models to ground entities in images and reason over factual knowledge. We study its implicit-knowledge variant, IK-KVQA, where a multimodal large language model (MLLM) is the sole knowledge source, without external retrieval. Yet, MLLMs lack explicit reasoning supervision and produce inconsistent justifications, and generalize poorly after standard supervised fine-tuning (SFT). We present StaR-KVQA (Structured Reasoning Traces for IK-KVQA), which supervises structured traces - dual symbolic relation paths plus path-grounded natural-language explanations - so that reasoning becomes transparent and verifiable. With one open-source MLLM, StaR-KVQA constructs and selects path-grounded reasoning traces to form a trace-enriched dataset, then fine-tunes via structured self-distillation to align generation with supervision; no external retrievers, verifiers, or curated knowledge bases (KBs) are used, traces are built offline, and inference is a single autoregressive pass. Across benchmarks, StaR-KVQA improves both accuracy and interpretability, achieving up to +11.3% higher answer accuracy on OK-VQA over the strongest baseline while exhibiting robust cross-domain generalization.

Problem

Research questions and friction points this paper is trying to address.

Improving implicit-knowledge visual question answering without external knowledge retrieval

Addressing inconsistent reasoning and poor generalization in multimodal language models

Enhancing answer accuracy and interpretability through structured reasoning traces

Innovation

Methods, ideas, or system contributions that make the work stand out.

Supervises structured reasoning traces for transparency

Fine-tunes via structured self-distillation without external resources

Builds trace-enriched dataset offline for single-pass inference

🔎 Similar Papers

No similar papers found.