xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

๐Ÿ“… 2024-08-16
๐Ÿ›๏ธ arXiv.org
๐Ÿ“ˆ Citations: 96
โœจ Influential: 12
๐Ÿ“„ PDF
๐Ÿค– AI Summary
The absence of a unified framework hinders research and application of open large multimodal models (LMMs). Method: We propose xGen-MM (BLIP-3), an open-source large vision-language model family, featuring (i) the first unified training paradigm for multi-image understanding; (ii) a multi-scale Transformer fusion architecture; (iii) safety alignment via high-quality multi-stage data curation, instruction tuning, and direct preference optimization (DPO); and (iv) context learning enhancement strategies. Contributions/Results: The base model exhibits strong in-context learning capabilities; the instruction-tuned variant achieves state-of-the-art performance among open-source LMMs on major benchmarks; DPO fine-tuning significantly reduces hallucination and harmful outputs; and the entire stackโ€”models, datasets, and codeโ€”is fully open-sourced, with reproducibility and generalization empirically validated across multiple benchmarks.

Technology Category

Application Category

๐Ÿ“ Abstract
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.
Problem

Research questions and friction points this paper is trying to address.

Develops open framework for large multimodal models
Evaluates models on single and multi-image tasks
Releases datasets and models to support research
Innovation

Methods, ideas, or system contributions that make the work stand out.

Open framework for Large Multimodal Models
Includes datasets, training recipe, and architectures
Competitive performance in image-text tasks
๐Ÿ”Ž Similar Papers
No similar papers found.
L
Le Xue
Salesforce AI Research
Manli Shu
Manli Shu
Google DeepMind
Multimodal modelsLarge language models
Anas Awadalla
Anas Awadalla
Stanford University
MLVision
J
Jun Wang
Salesforce AI Research
A
An Yan
Salesforce AI Research
Senthil Purushwalkam
Senthil Purushwalkam
OpenAI
Computer VisionMultimodal
Honglu Zhou
Honglu Zhou
Salesforce AI Research
Video UnderstandingMultimodal and Generative AIMachine Reasoning
Viraj Prabhu
Viraj Prabhu
Research Scientist, Salesforce AI Research
Computer VisionMachine LearningNatural Language Processing
Yutong Dai
Yutong Dai
Salesforce; Lehigh University
Multimodal Language ModelFederated LearningSparse Optimization
M
Michael S Ryoo
Salesforce AI Research
S
Shrikant B. Kendre
Salesforce AI Research
Jieyu Zhang
Jieyu Zhang
University of Washington
Data-Centric AIAgentic AIMultimodal ModelsMachine LearningComputer Vision
Can Qin
Can Qin
Salesforce
Computer VisionMachine LearningDeep Learning
S
Shu Zhang
Salesforce AI Research
Chia-Chih Chen
Chia-Chih Chen
Salesforce AI Research
N
Ning Yu
Salesforce AI Research
Juntao Tan
Juntao Tan
Research Scientist, Salesforce
Machine LearningExplainable AIRecommendation SystemInformation Retrieval
Tulika Awalgaonkar
Tulika Awalgaonkar
Salesforce
LLM/LMMsQuantizationLLM Agents
Shelby Heinecke
Shelby Heinecke
Salesforce Research
Artificial IntelligenceAI AgentsLLM AgentsMulti-Agent SystemsRecommendation Systems
H
Huan Wang
Salesforce AI Research
Yejin Choi
Yejin Choi
Stanford University / NVIDIA
Natural Language ProcessingDeep LearningArtificial IntelligenceCommonsense Reasoning
Ludwig Schmidt
Ludwig Schmidt
Stanford University and Anthropic
Machine LearningArtificial IntelligenceOptimizationAlgorithmsStatistics
Z
Zeyuan Chen
Salesforce AI Research
Silvio Savarese
Silvio Savarese
Associate Professor of Computer Science at Stanford University
Computer vision
Juan Carlos Niebles
Juan Carlos Niebles
Research Director (Salesforce) & Adjunct Professor (Stanford University)
Action RecognitionVideo UnderstandingVideo AnalysisComputer Vision
Caiming Xiong
Caiming Xiong
Salesforce Research
Machine LearningNLPComputer VisionMultimediaData Mining
R
Ran Xu
Salesforce AI Research