OphIn-500K: Curating Web-Scale Visual Instructions for Scaling Ophthalmic Multimodal Large Language Models

πŸ“… 2026-05-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the scarcity of large-scale, high-quality multimodal instruction-tuning data in ophthalmology, which limits the clinical applicability of specialized multimodal large language models (MLLMs). To overcome this challenge, the authors propose OphIn-Engine, a novel data engine that automatically extracts image–text pairs from openly available ophthalmic videos. By integrating visual cue disentanglement, quality assessment, and instruction synthesis, the engine constructs OphIn-500Kβ€”a dataset comprising 500,000 instructions aligned with 151,000 images. An MLLM trained on this dataset, named OphIn-VL, demonstrates significant performance gains over existing general-purpose and domain-specific medical MLLMs across visual question answering, multi-turn dialogue, and chain-of-thought reasoning tasks, thereby substantially enhancing clinical comprehension capabilities.
πŸ“ Abstract
The advancement of general medical Multimodal Large Language Models (MLLMs) has shown great potential for building conversational assistants to support clinical diagnosis. However, their adaptation to highly specialized domains such as ophthalmology remains underexplored, primarily due to the scarcity of large-scale, domain-specific instruction-tuning data. Existing ophthalmic datasets for conversational agents are often limited in scale and largely rely on images from established public benchmarks, limiting the scalability of ophthalmic MLLMs and their ability to capture real-world clinical complexity. To address this gap, we propose $\textbf{OphIn-Engine}$, an ophthalmology-specific instruction data curation pipeline that constructs high-quality instruction data from open-access ophthalmology web-scale videos. The pipeline integrates multimodal transcription for extracting image-transcript pairs, visual cue separation and scoring for identifying clinically relevant visual descriptions, and instruction synthesis with quality control for generating accurate and diverse clinical dialogues. Using this engine, we introduce $\textbf{OphIn-500K}$, a large-scale multimodal ophthalmology instruction-tuning dataset containing over 500,000 instruction instances and more than 151,000 unique images from over 29,000 video clips, formatted as visual question answering (VQA), multi-turn conversational interactions, and chain-of-thought (CoT) reasoning. Built upon this dataset, we further develop $\textbf{OphIn-VL}$, an ophthalmology-specific MLLM with advanced visual understanding and conversational capabilities. Comprehensive experiments and case studies demonstrate that OphIn-VL achieves superior performance compared with state-of-the-art general medical and domain-specific MLLMs.
Problem

Research questions and friction points this paper is trying to address.

ophthalmology
multimodal large language models
instruction-tuning data
clinical complexity
data scarcity
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction data curation
multimodal large language model
ophthalmology
visual cue separation
chain-of-thought reasoning
Xuanzhao Dong
Xuanzhao Dong
School of Computing and Augmented Intelligence, Arizona State University
machine learningdeep learninggenerative model
Wenhui Zhu
Wenhui Zhu
Arizona State University
Computer VisionArtificial intelligenceVision Language ModelLarge Language Model
Xiwen Chen
Xiwen Chen
Clemson University
Deep LearningMultimodalComputer VisionTime Series AnalysisVLM/LLM
Hao Wang
Hao Wang
Clemson University
AI Art
Xin Li
Xin Li
Arizona State University
Computer GraphicsComputer Vision
Yujian Xiong
Yujian Xiong
PhD Student, Arizona State University
geometric deep learningneuroimagingbrain imagingcomputer vision
J
Jiajun Cheng
Arizona State University
Jingjing Wang
Jingjing Wang
Professor, School of Cyber Science and Technology, Beihang University
AI for WirelessUAV NetworksSpace-Air-Ground-Sea NetworksCommunication Security
Xiaobing Yu
Xiaobing Yu
PhD student in Imaging Science in Washington University in St. Louis
Machine LearningComputational BiologyDeep LearningMedical Image Analysis
Haiyu Wu
Haiyu Wu
Research Scientist, Altos Labs
Computer VisionMulti-modal generative model
Shao Tang
Shao Tang
Linkedin
LLM Post-TrainingAgentOptimization
Zhipeng Wang
Zhipeng Wang
LinkedIn; ex-Google, Apple, Amazon. Rice University, PhD.
Efficient MLML SystemsMultimodal LLMRLBioinformatics
L
Langechuan Liu
NVIDIA
Shan Lin
Shan Lin
Arizona State University
robotic perceptionAImedical robotics
Oana Dumitrascu
Oana Dumitrascu
Associate Professor of Neurology, Mayo Clinic
NeurologyStrokeNeurodegenerationNeuro-ophthalmology
Yalin Wang
Yalin Wang
Professor of Computer Science and Engineering, Arizona State University
Brain ImagingComputer VisionMachine LearningStatistical Pattern Recognition