🤖 AI Summary
Existing vision-language models (VLMs) exhibit limited performance on zero-shot multi-step reasoning tasks, primarily due to their reliance on domain-specific subproblem decomposers and their tendency to force final answers even under insufficient information—compromising reasoning reliability. This paper proposes the first domain-agnostic, adaptive iterative decomposition framework: an LLM first generates subquestions; a VLM then provides visually grounded subanswers via multimodal grounding; finally, the LLM aggregates results and dynamically decides whether to terminate. This enables trustworthy, self-correcting convergence. The framework integrates zero-shot prompting synergy with a divide-and-conquer architecture, substantially enhancing reasoning robustness. Under zero-shot settings, it achieves absolute accuracy gains of +10.2% and +15.6% over the strongest GPT-4–based baselines on the VCR and SNLI-VE benchmarks, respectively.
📝 Abstract
The field of vision-and-language (VL) understanding has made unprecedented progress with end-to-end large pre-trained VL models (VLMs). However, they still fall short in zero-shot reasoning tasks that require multi-step inferencing. To achieve this goal, previous works resort to a divide-and-conquer pipeline. In this paper, we argue that previous efforts have several inherent shortcomings: 1) They rely on domain-specific sub-question decomposing models. 2) They force models to predict the final answer even if the sub-questions or sub-answers provide insufficient information. We address these limitations via IdealGPT, a framework that iteratively decomposes VL reasoning using large language models (LLMs). Specifically, IdealGPT utilizes an LLM to generate sub-questions, a VLM to provide corresponding sub-answers, and another LLM to reason to achieve the final answer. These three modules perform the divide-and-conquer procedure iteratively until the model is confident about the final answer to the main question. We evaluate IdealGPT on multiple challenging VL reasoning tasks under a zero-shot setting. In particular, our IdealGPT outperforms the best existing GPT-4-like models by an absolute 10% on VCR and 15% on SNLI-VE. Code is available at https://github.com/Hxyou/IdealGPT