IdealGPT: Iteratively Decomposing Vision and Language Reasoning via Large Language Models

📅 2023-05-24

🏛️ Conference on Empirical Methods in Natural Language Processing

📈 Citations: 42

✨ Influential: 6

career value

188K/year

🤖 AI Summary

Existing vision-language models (VLMs) exhibit limited performance on zero-shot multi-step reasoning tasks, primarily due to their reliance on domain-specific subproblem decomposers and their tendency to force final answers even under insufficient information—compromising reasoning reliability. This paper proposes the first domain-agnostic, adaptive iterative decomposition framework: an LLM first generates subquestions; a VLM then provides visually grounded subanswers via multimodal grounding; finally, the LLM aggregates results and dynamically decides whether to terminate. This enables trustworthy, self-correcting convergence. The framework integrates zero-shot prompting synergy with a divide-and-conquer architecture, substantially enhancing reasoning robustness. Under zero-shot settings, it achieves absolute accuracy gains of +10.2% and +15.6% over the strongest GPT-4–based baselines on the VCR and SNLI-VE benchmarks, respectively.

📝 Abstract

The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT

Problem

Research questions and friction points this paper is trying to address.

Enhances zero-shot multi-step vision-language reasoning tasks

Eliminates dependency on domain-specific sub-question decomposing models

Improves accuracy by iterative decomposition using LLMs and VLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

Iterative decomposition using LLMs for reasoning

Combines LLM sub-question generation with VLM answers

Confidence-based stopping for final answer prediction

🔎 Similar Papers

Better Language Models Exhibit Higher Visual Alignment