OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This study addresses the scarcity of large-scale, high-quality multimodal instruction-tuning data in ophthalmology, which limits the clinical applicability of specialized multimodal large language models (MLLMs). To overcome this challenge, the authors propose OphIn-Engine, a novel data engine that automatically extracts image–text pairs from openly available ophthalmic videos. By integrating visual cue disentanglement, quality assessment, and instruction synthesis, the engine constructs OphIn-500K—a dataset comprising 500,000 instructions aligned with 151,000 images. An MLLM trained on this dataset, named OphIn-VL, demonstrates significant performance gains over existing general-purpose and domain-specific medical MLLMs across visual question answering, multi-turn dialogue, and chain-of-thought reasoning tasks, thereby substantially enhancing clinical comprehension capabilities.

📝 Abstract

The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.

Problem

Research questions and friction points this paper is trying to address.

ophthalmology

multimodal large language models

instruction-tuning data

clinical complexity

data scarcity

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction data curation

multimodal large language model

ophthalmology