MedMax: Mixed-Modal Instruction Tuning for Training Biomedical Assistants

📅 2024-12-17

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

222K/year

🤖 AI Summary

Current biomedical multimodal datasets suffer from limited scale, narrow data provenance, and restricted task coverage, hindering the development of unified biomedical assistants. To address this, we introduce MedMax—the first large-scale (1.47M samples), multi-source (integrating medical literature and clinical YouTube videos), cross-domain (spanning radiology, pathology, etc.) hybrid-modal instruction-tuning dataset. We further design a unified multimodal evaluation suite supporting image-text generation, visual question answering (VQA), and report understanding. Our method incorporates cross-modal alignment training and domain-knowledge injection. Evaluated on 12 biomedical VQA benchmarks, models fine-tuned on MedMax outperform Chameleon by 26% and GPT-4o by 18.3%. All data, models, and code are publicly released.

Technology Category

Application Category

📝 Abstract

Recent advancements in mixed-modal generative have opened new avenues for developing unified biomedical assistants capable of analyzing biomedical images, answering complex questions about them, and generating multimodal patient reports. However, existing datasets face challenges such as small sizes, limited coverage of biomedical tasks and domains, and a reliance on narrow sources. To address these gaps, we present MedMax, a large-scale multimodal biomedical instruction-tuning dataset for mixed-modal foundation models. With 1.47 million instances, MedMax encompasses a diverse range of tasks, including interleaved image-text generation, biomedical image captioning and generation, visual chat, and report understanding. These tasks span knowledge across diverse biomedical domains, including radiology and histopathology, grounded in medical papers and YouTube videos. Subsequently, we fine-tune a mixed-modal foundation model on the MedMax dataset, achieving significant performance improvements: a 26% gain over the Chameleon model and an 18.3% improvement over GPT-4o across 12 downstream biomedical visual question-answering tasks. Finally, we introduce a unified evaluation suite for biomedical tasks to guide the development of mixed-modal biomedical AI assistants. The data, model, and code is available at https://mint-medmax.github.io/.

Problem

Research questions and friction points this paper is trying to address.

Addressing small size and limited coverage in biomedical datasets

Developing mixed-modal models for diverse biomedical tasks

Improving performance in biomedical visual question-answering tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large-scale multimodal biomedical instruction-tuning dataset

Fine-tuned mixed-modal foundation model

Unified evaluation suite for biomedical tasks

🔎 Similar Papers

No similar papers found.