๐ค AI Summary
The absence of a unified framework hinders research and application of open large multimodal models (LMMs). Method: We propose xGen-MM (BLIP-3), an open-source large vision-language model family, featuring (i) the first unified training paradigm for multi-image understanding; (ii) a multi-scale Transformer fusion architecture; (iii) safety alignment via high-quality multi-stage data curation, instruction tuning, and direct preference optimization (DPO); and (iv) context learning enhancement strategies. Contributions/Results: The base model exhibits strong in-context learning capabilities; the instruction-tuned variant achieves state-of-the-art performance among open-source LMMs on major benchmarks; DPO fine-tuning significantly reduces hallucination and harmful outputs; and the entire stackโmodels, datasets, and codeโis fully open-sourced, with reproducibility and generalization empirically validated across multiple benchmarks.
๐ Abstract
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.