🤖 AI Summary
This work addresses the limitations of existing biomedical retrieval models, which rely on coarse-grained binary relevance signals and struggle to capture fine-grained semantic overlap and hierarchical relationships among texts. To overcome this, the study introduces hierarchical multi-label contrastive learning into biomedical retrieval for the first time, leveraging the tree-structured MeSH ontology to construct structured supervision signals. The authors propose BioHiCL (Base/Large), a lightweight generative retrieval model that effectively encodes both semantic and hierarchical dependencies among labels. Evaluated across biomedical retrieval, sentence similarity, and question answering tasks, BioHiCL significantly outperforms current state-of-the-art methods while maintaining efficient inference performance.
📝 Abstract
Effective biomedical information retrieval requires modeling domain semantics and hierarchical relationships among biomedical texts. Existing biomedical generative retrievers build on coarse binary relevance signals, limiting their ability to capture semantic overlap. We propose BioHiCL (Biomedical Retrieval with Hierarchical Multi-Label Contrastive Learning), which leverages hierarchical MeSH annotations to provide structured supervision for multi-label contrastive learning. Our models, BioHiCL-Base (0.1B) and BioHiCL-Large (0.3B), achieve promising performance on biomedical retrieval, sentence similarity, and question answering tasks, while remaining computationally efficient for deployment.