🤖 AI Summary
The ophthalmology domain has long lacked open-source, domain-specific large language models (LLMs). To address this gap, we introduce LEME—the first open-source ophthalmology LLM—built upon Llama2-70B and fine-tuned on a proprietary, copyright-free ophthalmic dataset comprising 127,000 clinical cases, summaries, and educational materials. Our method employs a robust, reproducible domain-adaptation paradigm for multi-task learning, targeting clinical question answering, electronic health record (EHR) summarization, and medical examination response generation. LEME establishes a new benchmark for open-source specialty LLMs. Comprehensive evaluation demonstrates state-of-the-art performance: abstract completion achieves Rouge-L = 0.20 ± 0.03; long-context QA attains 0.19 ± 0.01; and EHR summarization scores 4.83/5. These results significantly advance both the clinical relevance and accessibility of AI in ophthalmology.
📝 Abstract
Large Language Models (LLMs) are poised to revolutionize healthcare. Ophthalmology-specific LLMs remain scarce and underexplored. We introduced an open-source, specialized LLM for ophthalmology, termed Language Enhanced Model for Eye (LEME). LEME was initially pre-trained on the Llama2 70B framework and further fine-tuned with a corpus of ~127,000 non-copyrighted training instances curated from ophthalmology-specific case reports, abstracts, and open-source study materials. We benchmarked LEME against eight other LLMs, namely, GPT-3.5, GPT-4, three Llama2 models (7B, 13B, 70B), PMC-LLAMA 13B, Meditron 70B, and EYE-Llama (another ophthalmology-specific LLM). Evaluations included four internal validation tasks: abstract completion, fill-in-the-blank, multiple-choice questions (MCQ), and short-answer QA. External validation tasks encompassed long-form QA, MCQ, patient EHR summarization, and clinical QA. Evaluation metrics included Rouge-L scores, accuracy, and expert evaluation of correctness, completeness, and readability. In internal validations, LEME consistently outperformed its counterparts, achieving Rouge-L scores of 0.20 ± 0.03 in abstract completion (all p<0.05), 0.82 ± 0.04 in fill-in-the-blank (all p<0.0001), and 0.22 ± 0.05 in short-answer QA (all p<0.0001, except versus GPT-4). In external validations, LEME excelled in long-form QA with a Rouge-L of 0.19 ± 0.01 (all p<0.0001), ranked second in MCQ accuracy (0.68 ± 0.09; all p<0.0001), and scored highest in EHR summarization and clinical QA (ranging from 4.24 to 4.83 out of 5 for correctness, completeness, and readability). LEME’s emphasis on robust fine-tuning and the use of non-copyrighted data represents a breakthrough in open-source ophthalmology-specific LLMs, offering the potential to revolutionize execution of clinical tasks while democratizing research collaboration.