xGen-MM (BLIP-3): A Family of Open Large Multimodal Models

📅 2024-08-16

🏛️ arXiv.org

📈 Citations: 96

✨ Influential: 12

career value

233K/year

🤖 AI Summary

The absence of a unified framework hinders research and application of open large multimodal models (LMMs). Method: We propose xGen-MM (BLIP-3), an open-source large vision-language model family, featuring (i) the first unified training paradigm for multi-image understanding; (ii) a multi-scale Transformer fusion architecture; (iii) safety alignment via high-quality multi-stage data curation, instruction tuning, and direct preference optimization (DPO); and (iv) context learning enhancement strategies. Contributions/Results: The base model exhibits strong in-context learning capabilities; the instruction-tuned variant achieves state-of-the-art performance among open-source LMMs on major benchmarks; DPO fine-tuning significantly reduces hallucination and harmful outputs; and the entire stack—models, datasets, and code—is fully open-sourced, with reproducibility and generalization empirically validated across multiple benchmarks.

Technology Category

Application Category

📝 Abstract

This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.

Problem

Research questions and friction points this paper is trying to address.

Develops open framework for large multimodal models

Evaluates models on single and multi-image tasks

Releases datasets and models to support research

Innovation

Methods, ideas, or system contributions that make the work stand out.

Open framework for Large Multimodal Models

Includes datasets, training recipe, and architectures

Competitive performance in image-text tasks

🔎 Similar Papers

No similar papers found.

TikTok

San Jose, California

Research Scientist Intern, Multimodal AI (PhD)