Parrot: Multilingual Visual Instruction Tuning

πŸ“… 2024-06-04
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 10
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Multimodal large language models (MLLMs) suffer from degraded non-English visual instruction understanding due to English-dominant supervised fine-tuning data. Method: We propose a text-guided, language-level visual token alignment framework. Its core components include: (1) the first language-conditioned visual token generation mechanism; (2) a Mixture-of-Experts (MoE)-based architecture for multilingual visual representation alignment; and (3) the MMMB benchmarkβ€”the first large-scale multilingual multimodal evaluation suite covering 6 languages, 15 task categories, and 12K samples. Our approach integrates cross-modal cross-attention, MoE-driven multilingual joint fine-tuning, and synthetic data alignment. Results: Our method achieves state-of-the-art performance on both the multilingual subset of MMBench and MMMB. It significantly improves general multimodal capabilities across languages. Code and training data are fully open-sourced.

Technology Category

Application Category

πŸ“ Abstract
The rapid development of Multimodal Large Language Models (MLLMs) like GPT-4V has marked a significant step towards artificial general intelligence. Existing methods mainly focus on aligning vision encoders with LLMs through supervised fine-tuning (SFT) to endow LLMs with multimodal abilities, making MLLMs' inherent ability to react to multiple languages progressively deteriorate as the training process evolves. We empirically find that the imbalanced SFT datasets, primarily composed of English-centric image-text pairs, lead to significantly reduced performance in non-English languages. This is due to the failure of aligning the vision encoder and LLM with multilingual tokens during the SFT process. In this paper, we introduce Parrot, a novel method that utilizes textual guidance to drive visual token alignment at the language level. Parrot makes the visual tokens condition on diverse language inputs and uses Mixture-of-Experts (MoE) to promote the alignment of multilingual tokens. Specifically, to enhance non-English visual tokens alignment, we compute the cross-attention using the initial visual features and textual embeddings, the result of which is then fed into the MoE router to select the most relevant experts. The selected experts subsequently convert the initial visual tokens into language-specific visual tokens. Moreover, considering the current lack of benchmarks for evaluating multilingual capabilities within the field, we collect and make available a Massive Multilingual Multimodal Benchmark which includes 6 languages, 15 categories, and 12,000 questions, named as MMMB. Our method not only demonstrates state-of-the-art performance on multilingual MMBench and MMMB, but also excels across a broad range of multimodal tasks. Both the source code and the training dataset of Parrot will be made publicly available. Code is available at: https://github.com/AIDC-AI/Parrot.
Problem

Research questions and friction points this paper is trying to address.

Addresses multilingual token alignment degradation in MLLMs
Proposes PARROT for language-specific visual token alignment
Introduces MMMB benchmark for multilingual multimodal evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages textual guidance for visual token alignment
Uses Mixture-of-Experts for multilingual token alignment
Introduces Massive Multilingual Multimodal Benchmark
πŸ”Ž Similar Papers
No similar papers found.
Hai-Long Sun
Hai-Long Sun
LAMDA Group, Nanjing University
Machine LearningMLLMContinual Learning
Da-Wei Zhou
Da-Wei Zhou
Associate Researcher, Nanjing University
Incremental LearningContinual LearningOpen-World LearningModel Reuse
Y
Yang Li
AI Business, Alibaba Group
Shiyin Lu
Shiyin Lu
Alibaba Group
Multimodal Large Language ModelsOnline LearningBandits
C
Chao Yi
School of Artificial Intelligence, Nanjing University; National Key Laboratory for Novel Software Technology, Nanjing University
Q
Qingguo Chen
AI Business, Alibaba Group
Z
Zhao Xu
AI Business, Alibaba Group
Weihua Luo
Weihua Luo
Alibaba
natural language processingmachine learningartificial intelligence
Kaifu Zhang
Kaifu Zhang
Assistant Professor of Marketing, Carnegie Mellon University
Two-sided marketsInternet platformse-commerce
D
De-chuan Zhan
School of Artificial Intelligence, Nanjing University; National Key Laboratory for Novel Software Technology, Nanjing University
Han-Jia Ye
Han-Jia Ye
Nanjing University
Machine LearningData MiningMetric LearningMeta-Learning