On Domain-Specific Post-Training for Multimodal Large Language Models

📅 2024-11-29

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

193K/year

🤖 AI Summary

General-purpose multimodal large language models (MLLMs) face challenges in domain-specific adaptation—particularly in scientific and industrial verticals—due to scarce domain data, high annotation costs, and cumbersome multi-stage fine-tuning pipelines. Method: We propose a domain-adaptive post-training framework featuring: (1) an open-model-driven generate-then-filter data synthesis paradigm, replacing manual rule-based curation and proprietary vision-language models (e.g., GPT-4V); (2) a single-stage, end-to-end visual instruction tuning pipeline that eliminates the conventional two-stage (pretraining + supervised fine-tuning) approach, enhancing task diversity and adaptation efficiency; and (3) a cross-domain evaluation benchmark spanning biomedical, food, and remote sensing domains. Results: Experiments demonstrate substantial performance gains for MLLMs on domain-specific tasks. All models, code, and synthetic datasets are publicly released to advance research and deployment of domain-specialized multimodal AI.

Technology Category

Application Category

📝 Abstract

Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models (e.g., GPT-4V) in enhancing domain-specific performance. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct extensive experiments in high-impact domains such as biomedicine, food, and remote sensing, by post-training a variety of MLLMs and then evaluating MLLM performance on various domain-specific tasks. Furthermore, we fully open-source our models, code, and data to encourage future research in this area.

Problem

Research questions and friction points this paper is trying to address.

Adapting MLLMs to specific domains for practical applications

Developing a data synthesis pipeline for domain-specific visual tasks

Evaluating MLLM performance in high-impact domains post-training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed generate-then-filter data synthesis pipeline

Applied single-stage training for domain-specific tasks

Conducted extensive domain-specific task evaluations

🔎 Similar Papers

No similar papers found.