On Domain-Specific Post-Training for Multimodal Large Language Models

📅 2024-11-29
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF
🤖 AI Summary
General-purpose multimodal large language models (MLLMs) face challenges in domain-specific adaptation—particularly in scientific and industrial verticals—due to scarce domain data, high annotation costs, and cumbersome multi-stage fine-tuning pipelines. Method: We propose a domain-adaptive post-training framework featuring: (1) an open-model-driven generate-then-filter data synthesis paradigm, replacing manual rule-based curation and proprietary vision-language models (e.g., GPT-4V); (2) a single-stage, end-to-end visual instruction tuning pipeline that eliminates the conventional two-stage (pretraining + supervised fine-tuning) approach, enhancing task diversity and adaptation efficiency; and (3) a cross-domain evaluation benchmark spanning biomedical, food, and remote sensing domains. Results: Experiments demonstrate substantial performance gains for MLLMs on domain-specific tasks. All models, code, and synthetic datasets are publicly released to advance research and deployment of domain-specialized multimodal AI.

Technology Category

Application Category

📝 Abstract
Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models (e.g., GPT-4V) in enhancing domain-specific performance. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct extensive experiments in high-impact domains such as biomedicine, food, and remote sensing, by post-training a variety of MLLMs and then evaluating MLLM performance on various domain-specific tasks. Furthermore, we fully open-source our models, code, and data to encourage future research in this area.
Problem

Research questions and friction points this paper is trying to address.

Adapting MLLMs to specific domains for practical applications
Developing a data synthesis pipeline for domain-specific visual tasks
Evaluating MLLM performance in high-impact domains post-training
Innovation

Methods, ideas, or system contributions that make the work stand out.

Developed generate-then-filter data synthesis pipeline
Applied single-stage training for domain-specific tasks
Conducted extensive domain-specific task evaluations
🔎 Similar Papers
No similar papers found.
Daixuan Cheng
Daixuan Cheng
Gaoling School of AI, Renmin University of China
LLM Pre-TrainingDomain AdaptationReasoning
Shaohan Huang
Shaohan Huang
Microsoft Research Asia
Z
Ziyu Zhu
State Key Laboratory of General Artificial Intelligence, BIGAI,Tsinghua University
X
Xintong Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI,Beijing Institute of Technology
Wayne Xin Zhao
Wayne Xin Zhao
Professor, Renmin University of China
Recommender SystemNatural Language ProcessingLarge Language Model
Zhongzhi Luan
Zhongzhi Luan
Beihang University
B
Bo Dai
State Key Laboratory of General Artificial Intelligence, BIGAI
Z
Zhenliang Zhang
State Key Laboratory of General Artificial Intelligence, BIGAI