🤖 AI Summary
Medical text embedding models face three major bottlenecks: narrow training data coverage, outdated methodologies, and insufficient evaluation benchmarks—severely limiting their generalizability in real-world clinical settings. To address these challenges, we propose a systematic solution: (1) We introduce MedEval, the first fine-grained, medical-domain-specific benchmark covering 51 diverse tasks to comprehensively evaluate semantic understanding, retrieval, classification, and clustering capabilities; (2) Leveraging heterogeneous, multi-source medical corpora, we perform domain-adaptive fine-tuning of the GTE architecture via self-supervised contrastive learning. Experiments demonstrate that our model achieves significant gains over state-of-the-art methods on MedEval, with improvements of 12.7%–23.4% on critical tasks—including clinical term similarity estimation, electronic health record retrieval, and cross-modal alignment—establishing a reproducible, multidimensional evaluation paradigm for medical embeddings.
📝 Abstract
Medical text embedding models are foundational to a wide array of healthcare applications, ranging from clinical decision support and biomedical information retrieval to medical question answering, yet they remain hampered by two critical shortcomings. First, most models are trained on a narrow slice of medical and biological data, beside not being up to date in terms of methodology, making them ill suited to capture the diversity of terminology and semantics encountered in practice. Second, existing evaluations are often inadequate: even widely used benchmarks fail to generalize across the full spectrum of real world medical tasks.
To address these gaps, we leverage MEDTE, a GTE model extensively fine-tuned on diverse medical corpora through self-supervised contrastive learning across multiple data sources, to deliver robust medical text embeddings.
Alongside this model, we propose a comprehensive benchmark suite of 51 tasks spanning classification, clustering, pair classification, and retrieval modeled on the Massive Text Embedding Benchmark (MTEB) but tailored to the nuances of medical text. Our results demonstrate that this combined approach not only establishes a robust evaluation framework but also yields embeddings that consistently outperform state of the art alternatives in different tasks.