Visually Interpretable Subtask Reasoning for Visual Question Answering

📅 2025-05-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing multimodal large language models (MLLMs) suffer from high computational overhead, poor target adaptability, and limited interpretability when decomposing complex visual question answering (VQA) tasks—e.g., “Which red furniture items can be sat on?”—into sequential subtasks. To address this, we propose VISTAR: the first framework enabling a single MLLM to jointly generate structured textual descriptions and visual subtask reasoning chains (Subtask-of-Thought), thereby supporting interpretable, multi-step reasoning for object recognition, attribute filtering, and relational understanding. Our approach comprises subtask-driven MLLM fine-tuning, multimodal reasoning chain modeling, unified text-visual explanation generation, and end-to-end differentiable training—requiring no external models. Evaluated on two major benchmarks, VISTAR significantly improves reasoning accuracy while providing step-wise visual grounding and semantic explanations. The code and dataset are publicly released.

Technology Category

Application Category

📝 Abstract
Answering complex visual questions like `Which red furniture can be used for sitting?' requires multi-step reasoning, including object recognition, attribute filtering, and relational understanding. Recent work improves interpretability in multimodal large language models (MLLMs) by decomposing tasks into sub-task programs, but these methods are computationally expensive and less accurate due to poor adaptation to target data. To address this, we introduce VISTAR (Visually Interpretable Subtask-Aware Reasoning Model), a subtask-driven training framework that enhances both interpretability and reasoning by generating textual and visual explanations within MLLMs. Instead of relying on external models, VISTAR fine-tunes MLLMs to produce structured Subtask-of-Thought rationales (step-by-step reasoning sequences). Experiments on two benchmarks show that VISTAR consistently improves reasoning accuracy while maintaining interpretability. Our code and dataset will be available at https://github.com/ChengJade/VISTAR.
Problem

Research questions and friction points this paper is trying to address.

Enhancing interpretability in visual question answering
Improving reasoning accuracy in multimodal language models
Generating visual and textual explanations for subtasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Subtask-driven training framework for MLLMs
Generates textual and visual explanations internally
Fine-tunes MLLMs for Subtask-of-Thought rationales
🔎 Similar Papers
No similar papers found.
Y
Yu Cheng
University of Edinburgh
A
A. Goel
NVIDIA
Hakan Bilen
Hakan Bilen
University of Edinburgh
Computer VisionMachine Learning