What Makes for Good Visual Instructions? Synthesizing Complex Visual Reasoning Instructions for Visual Instruction Tuning

📅 2023-11-02

🏛️ International Conference on Computational Linguistics

📈 Citations: 14

✨ Influential: 1

🤖 AI Summary

This work addresses the challenge of enhancing zero-shot generalization in multimodal large language models (MLLMs) through principled visual instruction design. We propose a novel three-stage automated instruction construction paradigm—“Synthesize-Complexify-Reconstruct”—and empirically establish, for the first time, a positive correlation between instruction complexity and model performance. We introduce ComVint, the first high-quality instruction-tuning dataset tailored for complex visual reasoning, comprising 32K samples. Our method integrates vision-language joint modeling with LLM-based instruction reconstruction and quality assurance. On the MME-Perception and MME-Cognition benchmarks, our approach improves LLaVA’s performance by 27.86% and 27.60%, respectively, and delivers consistent, significant gains across four mainstream MLLMs. The code and ComVint dataset are publicly released.

📝 Abstract

Visual instruction tuning is crucial for enhancing the zero-shot generalization capability of Multi-modal Large Language Models (MLLMs). In this paper, we aim to investigate a fundamental question: ''what makes for good visual instructions''. Through a comprehensive empirical study, we find that instructions focusing on complex visual reasoning tasks are particularly effective in improving the performance of MLLMs, with results correlating to instruction complexity. Based on this insight, we develop a systematic approach to automatically create high-quality complex visual reasoning instructions. Our approach employs a synthesize-complicate-reformulate paradigm, leveraging multiple stages to gradually increase the complexity of the instructions while guaranteeing quality. Based on this approach, we create the ComVint dataset with 32K examples, and fine-tune four MLLMs on it. Experimental results consistently demonstrate the enhanced performance of all compared MLLMs, such as a 27.86% and 27.60% improvement for LLaVA on MME-Perception and MME-Cognition, respectively. Our code and data are publicly available at the link: https://github.com/RUCAIBox/ComVint.

Problem

Research questions and friction points this paper is trying to address.

Complex visual reasoning instructions

Multi-modal Large Language Models

Zero-shot generalization capability

Innovation

Methods, ideas, or system contributions that make the work stand out.

Synthesize-complicate-reformulate paradigm

Automates complex visual instructions

Enhances MLLM zero-shot generalization

🔎 Similar Papers

No similar papers found.

Authors to Follow