Bootstrapping Grounded Chain-of-Thought in Multimodal LLMs for Data-Efficient Model Adaptation

📅 2025-07-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Multimodal large language models (MLLMs) excel at general image understanding but struggle to adapt to specialized visual tasks—such as chart and table interpretation—with few-shot examples, due to insufficient coverage of non-object-centric images in pretraining data. Method: We propose Grounded Chain-of-Thought (GCoT), a framework that enhances image faithfulness and cross-modal alignment by injecting bounding-box-level visual grounding into the reasoning process. GCoT employs a self-bootstrapping mechanism integrating chain-of-thought reasoning, knowledge distillation, and precise localization annotations to construct high-quality, explicitly grounded reasoning data. Contribution/Results: Evaluated across five specialized visual domains, GCoT achieves significant improvements in accuracy and interpretability over conventional fine-tuning and distillation baselines under extremely low-data regimes (e.g., ≤16 samples per task). It establishes an efficient, lightweight paradigm for adapting MLLMs to data-scarce professional domains while preserving visual grounding fidelity.

Technology Category

Application Category

📝 Abstract
Multimodal Large Language Models (MLLMs) have demonstrated remarkable capabilities in interpreting images using natural language. However, without using large-scale datasets for retraining, these models are difficult to adapt to specialized vision tasks, e.g., chart understanding. This problem is caused by a mismatch between pre-training and downstream datasets: pre-training datasets primarily concentrate on scenes and objects but contain limited information about specialized, non-object images, such as charts and tables. In this paper, we share an interesting finding that training an MLLM with Chain-of-Thought (CoT) reasoning data can facilitate model adaptation in specialized vision tasks, especially under data-limited regimes. However, we identify a critical issue within CoT data distilled from pre-trained MLLMs, i.e., the data often contains multiple factual errors in the reasoning steps. To address the problem, we propose Grounded Chain-of-Thought (GCoT), a simple bootstrapping-based approach that aims to inject grounding information (i.e., bounding boxes) into CoT data, essentially making the reasoning steps more faithful to input images. We evaluate our approach on five specialized vision tasks, which cover a variety of visual formats including charts, tables, receipts, and reports. The results demonstrate that under data-limited regimes our approach significantly improves upon fine-tuning and distillation.
Problem

Research questions and friction points this paper is trying to address.

Adapting MLLMs to specialized vision tasks without large datasets
Factual errors in Chain-of-Thought reasoning data from MLLMs
Improving reasoning faithfulness in multimodal tasks via grounding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Chain-of-Thought reasoning for adaptation
Injects grounding info to correct reasoning errors
Improves specialized vision tasks efficiently
🔎 Similar Papers
No similar papers found.