LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

📅 2024-11-15

🏛️ arXiv.org

📈 Citations: 6

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Current vision-language models (VLMs) lack systematic logical reasoning capabilities for complex visual question answering (VQA), particularly in information integration, stepwise deduction, and conclusion generation. Method: We propose a multi-stage autonomous reasoning paradigm—comprising summarization, visual understanding, logical deduction, and conclusion generation—and construct LLaVA-CoT-100k, the first large-scale VQA dataset with structured chain-of-thought (CoT) annotations. We further design a stage-wise beam search algorithm to enhance reasoning efficiency and fidelity, and perform end-to-end fine-tuning on the LLaVA architecture using multi-source VQA data with fine-grained CoT supervision. Contribution/Results: Our approach achieves a 7.4% absolute improvement over strong baselines on multimodal reasoning benchmarks, outperforming larger or closed-source models including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

Technology Category

Application Category

📝 Abstract

Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

Problem

Research questions and friction points this paper is trying to address.

Visual Language Models

Complex Tasks Processing

Logical Reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-by-step Visual Reasoning

Large-scale Visual Captioning Dataset

Adaptability Optimization Method

🔎 Similar Papers

ViGoR: Improving Visual Grounding of Large Vision Language Models with Fine-Grained Reward Modeling