Fast or Slow? Integrating Fast Intuition and Deliberate Thinking for Enhancing Visual Question Answering

📅 2025-06-01

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

MLLMs exhibit limited performance on complex VQA reasoning tasks, primarily due to blind detection and annotation of all visual objects—introducing redundant visual tokens and noise. To address this, we propose FOCUS, a dynamic adaptation framework grounded in dual-process cognitive theory: it is the first to explicitly model “fast thinking” (zero-shot intuitive reasoning) and “slow thinking” (deliberate conceptual analysis) for VQA, enabling question-complexity-driven selection of visual prompts. FOCUS introduces three key mechanisms: (i) conceptualization-before-observation, (ii) plug-and-play visual attention gating, and (iii) multi-granularity visual token refinement. Evaluated on four major benchmarks—ScienceQA, TextQA, VizWiz, and MME—FOCUS significantly improves both open-source and black-box MLLM performance. Ablation studies confirm that synergistic multi-cognitive strategies and precise visual information refinement are critical to these gains.

Technology Category

Application Category

📝 Abstract

Multimodal large language models (MLLMs) still struggle with complex reasoning tasks in Visual Question Answering (VQA). While current methods have advanced by incorporating visual prompts, our study uncovers critical limitations: these approaches indiscriminately annotate all detected objects for every visual question, generating excessive visual markers that degrade task performance. This issue stems primarily from a lack of focus on key visual elements, raising two important questions: Are all objects equally important, and do all questions require visual prompts? Motivated by Dual Process Theory, which distinguishes between instinctive and deliberate cognitive modes in human reasoning, we propose FOCUS, a plug-and-play approach that dynamically adapts to the complexity of questions, combining fast intuitive judgments with deliberate analytical reasoning to enhance the vision-language reasoning capability of the MLLM. For straightforward questions, FOCUS supports efficient zero-shot reasoning. For more complex tasks, it employs the conceptualizing before observation strategy to highlight critical elements. Extensive experiments on four benchmarks, ScienceQA, TextQA, VizWiz, and MME, demonstrate that FOCUS consistently improves the performance of both open-source and black-box MLLMs, achieving significant gains across all datasets. Ablation studies further validate the importance of combining diverse cognitive strategies with refined visual information for superior performance. Code will be released.

Problem

Research questions and friction points this paper is trying to address.

Enhancing VQA by integrating fast intuition and deliberate thinking

Reducing excessive visual markers to improve task performance

Adapting to question complexity for better vision-language reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic adaptation to question complexity

Combines fast intuition and deliberate reasoning

Conceptualizing before observation strategy

🔎 Similar Papers

Dual Thinking and Logical Processing -- Are Multi-modal Large Language Models Closing the Gap with Human Vision ?