🤖 AI Summary
General-purpose multimodal large language models (MLLMs) face challenges in domain-specific adaptation—particularly in scientific and industrial verticals—due to scarce domain data, high annotation costs, and cumbersome multi-stage fine-tuning pipelines. Method: We propose a domain-adaptive post-training framework featuring: (1) an open-model-driven generate-then-filter data synthesis paradigm, replacing manual rule-based curation and proprietary vision-language models (e.g., GPT-4V); (2) a single-stage, end-to-end visual instruction tuning pipeline that eliminates the conventional two-stage (pretraining + supervised fine-tuning) approach, enhancing task diversity and adaptation efficiency; and (3) a cross-domain evaluation benchmark spanning biomedical, food, and remote sensing domains. Results: Experiments demonstrate substantial performance gains for MLLMs on domain-specific tasks. All models, code, and synthetic datasets are publicly released to advance research and deployment of domain-specialized multimodal AI.
📝 Abstract
Adapting general multimodal large language models (MLLMs) to specific domains, such as scientific and industrial fields, is highly significant in promoting their practical applications. This paper systematically investigates domain adaptation of MLLMs through post-training, focusing on data synthesis, training pipelines, and task evaluation. (1) Data Synthesis: Using only open-source models, we develop a generate-then-filter pipeline that curates diverse visual instruction tasks based on domain-specific image-caption pairs. The resulting data surpass the data synthesized by manual rules or strong closed-source models (e.g., GPT-4V) in enhancing domain-specific performance. (2) Training Pipeline: While the two-stage training--initially on image-caption pairs followed by visual instruction tasks--is commonly adopted for developing general MLLMs, we apply a single-stage training pipeline to enhance task diversity for domain-specific post-training. (3) Task Evaluation: We conduct extensive experiments in high-impact domains such as biomedicine, food, and remote sensing, by post-training a variety of MLLMs and then evaluating MLLM performance on various domain-specific tasks. Furthermore, we fully open-source our models, code, and data to encourage future research in this area.