Instruction-tuned Self-Questioning Framework for Multimodal Reasoning

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

Visual-language models face two key bottlenecks in multi-step visual question answering (VQA): limited fine-grained visual perception and opaque, non-reproducible reasoning due to reliance on black-box large language models (LLMs). To address these, we propose an instruction-tuned self-asking framework comprising three coordinated modules—Questioner, Answerer, and Reasoner—built upon InstructBLIP. This framework iteratively generates perception-oriented sub-questions grounded in image content and fuses their answers, without accessing LLM internal parameters. By explicitly decomposing the reasoning path into interpretable, reproducible steps, it enhances both transparency and fidelity. Experiments demonstrate substantial improvements over state-of-the-art methods across multiple multi-step VQA benchmarks. Results validate that generative sub-questions serve as effective structured intermediate supervision signals, significantly boosting reasoning performance and robustness.

Technology Category

Application Category

📝 Abstract

The field of vision-language understanding has been actively researched in recent years, thanks to the development of Large Language Models~(LLMs). However, it still needs help with problems requiring multi-step reasoning, even for very simple questions. Recent studies adopt LLMs to tackle this problem by iteratively generating sub-questions and answers. However, there are disadvantages such as 1) the fine-grained visual contents of images are not available using LLMs that cannot read visual information, 2) internal mechanisms are inaccessible and difficult to reproduce by using black-box LLMs. To solve these problems, we propose the SQ (Self-Questioning)-InstructBLIP, which improves inference performance by generating image-aware informative sub-questions and sub-answers iteratively. The SQ-InstructBLIP, which consists of a Questioner, Answerer, and Reasoner that share the same architecture. Questioner and Answerer generate sub-questions and sub-answers to help infer the main-question, and Reasoner performs reasoning on the main-question considering the generated sub-question information. Our experiments show that the proposed method SQ-InstructBLIP, which uses the generated sub-questions as additional information when solving the VQA task, performs more accurate reasoning than the previous works.

Problem

Research questions and friction points this paper is trying to address.

Addresses limitations in multimodal reasoning for vision-language understanding

Solves issues with black-box LLMs lacking visual content access

Improves multi-step reasoning accuracy for visual question answering

Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-questioning framework for multimodal reasoning

Iterative sub-question generation using image-aware models

Shared architecture Questioner-Answerer-Reasoner components

🔎 Similar Papers

No similar papers found.