🤖 AI Summary
To address hallucination, high training costs, and poor generalization in multimodal large language models (MLLMs) for complex visual reasoning, this paper proposes the Socratic Questioning (SQ) framework. SQ employs a multi-round self-questioning mechanism to guide lightweight MLLMs toward salient visual cues, introducing the first heuristic self-guided reasoning paradigm that integrates chain-of-thought (CoT) prompting with vision-instruction fine-tuning. We construct CapQA, the first benchmark dataset for fine-grained activity understanding, and design a novel hallucination quantification metric. Experiments demonstrate that SQ reduces hallucination scores by 31.2%, significantly improves zero-shot performance on complex visual reasoning and fine-grained description tasks, and achieves state-of-the-art results across multiple benchmarks.
📝 Abstract
Complex visual reasoning remains a key challenge today. Typically, the challenge is tackled using methodologies such as Chain of Thought (COT) and visual instruction tuning. However, how to organically combine these two methodologies for greater success remains unexplored. Also, issues like hallucinations and high training cost still need to be addressed. In this work, we devise an innovative multi-round training and reasoning framework suitable for lightweight Multimodal Large Language Models (MLLMs). Our self-questioning approach heuristically guides MLLMs to focus on visual clues relevant to the target problem, reducing hallucinations and enhancing the model's ability to describe fine-grained image details. This ultimately enables the model to perform well in complex visual reasoning and question-answering tasks. We have named this framework Socratic Questioning(SQ). To facilitate future research, we create a multimodal mini-dataset named CapQA, which includes 1k images of fine-grained activities, for visual instruction tuning and evaluation, our proposed SQ method leads to a 31.2% improvement in the hallucination score. Our extensive experiments on various benchmarks demonstrate SQ's remarkable capabilities in heuristic self-questioning, zero-shot visual reasoning and hallucination mitigation. Our model and code will be publicly available.