GPT4RoI: Instruction Tuning Large Language Model on Region-of-Interest

📅 2023-07-07

🏛️ arXiv.org

📈 Citations: 204

✨ Influential: 12

career value

201K/year

🤖 AI Summary

Existing vision-language models primarily support image-level understanding, limiting their capability for fine-grained visual-language interaction. To address this, we propose a spatial instruction tuning framework grounded in Region-of-Interest (RoI) referencing. Our method dynamically replaces natural language region descriptions with corresponding RoI visual features and interleaves them with the language sequence for joint modeling. It supports dual-modal input—textual instructions augmented with hand-drawn bounding boxes—enabling precise region grounding and multi-attribute perception (e.g., color, shape, material, action), as well as cross-region commonsense reasoning. Trained jointly on seven region-annotated datasets, our model achieves 81.6% accuracy on the Visual Commonsense Reasoning (VCR) benchmark, outperforming the previous state of the art by 6.0 percentage points and approaching human performance (85.0%).

📝 Abstract

Visual instruction tuning large language model(LLM) on image-text pairs has achieved general-purpose vision-language abilities. However, the lack of region-text pairs limits their advancements to fine-grained multimodal understanding. In this paper, we propose spatial instruction tuning, which introduces the reference to the region-of-interest(RoI) in the instruction. Before sending to LLM, the reference is replaced by RoI features and interleaved with language embeddings as a sequence. Our model GPT4RoI, trained on 7 region-text pair datasets, brings an unprecedented interactive and conversational experience compared to previous image-level models. (1) Interaction beyond language: Users can interact with our model by both language and drawing bounding boxes to flexibly adjust the referring granularity. (2) Versatile multimodal abilities: A variety of attribute information within each RoI can be mined by GPT4RoI, e.g., color, shape, material, action, etc. Furthermore, it can reason about multiple RoIs based on common sense. On the Visual Commonsense Reasoning(VCR) dataset, GPT4RoI achieves a remarkable accuracy of 81.6%, surpassing all existing models by a significant margin (the second place is 75.6%) and almost reaching human-level performance of 85.0%. The code and model can be found at https://github.com/jshilong/GPT4RoI.

Problem

Research questions and friction points this paper is trying to address.

Enhances fine-grained vision-language understanding using region-text pairs.

Introduces spatial instruction tuning for region-of-interest (RoI) references.

Achieves superior multimodal reasoning and interaction capabilities.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Spatial instruction tuning with RoI features

Interactive model via language and bounding boxes

Multimodal reasoning with high accuracy on VCR

🔎 Similar Papers

No similar papers found.