LLaVA-CoT: Let Vision Language Models Reason Step-by-Step

📅 2024-11-15
🏛️ arXiv.org
📈 Citations: 6
Influential: 0
📄 PDF
🤖 AI Summary
Current vision-language models (VLMs) lack systematic logical reasoning capabilities for complex visual question answering (VQA), particularly in information integration, stepwise deduction, and conclusion generation. Method: We propose a multi-stage autonomous reasoning paradigm—comprising summarization, visual understanding, logical deduction, and conclusion generation—and construct LLaVA-CoT-100k, the first large-scale VQA dataset with structured chain-of-thought (CoT) annotations. We further design a stage-wise beam search algorithm to enhance reasoning efficiency and fidelity, and perform end-to-end fine-tuning on the LLaVA architecture using multi-source VQA data with fine-grained CoT supervision. Contribution/Results: Our approach achieves a 7.4% absolute improvement over strong baselines on multimodal reasoning benchmarks, outperforming larger or closed-source models including Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.

Technology Category

Application Category

📝 Abstract
Large language models have demonstrated substantial advancements in reasoning capabilities, particularly through inference-time scaling, as illustrated by models such as OpenAI's o1. However, current Vision-Language Models (VLMs) often struggle to perform systematic and structured reasoning, especially when handling complex visual question-answering tasks. In this work, we introduce LLaVA-CoT, a novel VLM designed to conduct autonomous multistage reasoning. Unlike chain-of-thought prompting, LLaVA-CoT independently engages in sequential stages of summarization, visual interpretation, logical reasoning, and conclusion generation. This structured approach enables LLaVA-CoT to achieve marked improvements in precision on reasoning-intensive tasks. To accomplish this, we compile the LLaVA-CoT-100k dataset, integrating samples from various visual question answering sources and providing structured reasoning annotations. Besides, we propose an inference-time stage-level beam search method, which enables effective inference-time scaling. Remarkably, with only 100k training samples and a simple yet effective inference time scaling method, LLaVA-CoT not only outperforms its base model by 7.4% on a wide range of multimodal reasoning benchmarks, but also surpasses the performance of larger and even closed-source models, such as Gemini-1.5-pro, GPT-4o-mini, and Llama-3.2-90B-Vision-Instruct.
Problem

Research questions and friction points this paper is trying to address.

Visual Language Models
Complex Tasks Processing
Logical Reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Step-by-step Visual Reasoning
Large-scale Visual Captioning Dataset
Adaptability Optimization Method
🔎 Similar Papers
No similar papers found.
Guowei Xu
Guowei Xu
Tsinghua University
Language ModelsReinforcement Learning
P
Peng Jin
School of Electronic and Computer Engineering, Peking University; Rabbitpre AI & PKU Shenzhen AIGC Joint Lab; Peng Cheng Laboratory
H
Hao Li
School of Electronic and Computer Engineering, Peking University; Peng Cheng Laboratory
Yibing Song
Yibing Song
Deputy Chief Engineer, BYD Group
Multi-Modal AI
L
Lichao Sun
Computer Science and Engineering, Lehigh University
Li Yuan
Li Yuan
Research Associate, University of Science & Technology of China (USTC)
Antibiotic resistanceWastewater treatmentEnvironmental bioremediationAnaerobic digestionFate of organic pollutants