BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives

📅 2025-11-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenges of hard negative mining and limited performance of small models in biomedical dense retrieval, this paper proposes a citation-aware hard negative construction method. Leveraging citation relationships among PubMed articles, it automatically identifies highly relevant, non-redundant, domain-specific hard negatives. Crucially, it is the first to explicitly model citation structure as a signal for hard negative generation—requiring no manual annotation or auxiliary training and enabling zero-shot transfer. Evaluated on GTE-small and GTE-base, the method significantly improves nDCG@10 and Success@5 across BEIR and LoTTE benchmarks; notably, it achieves state-of-the-art Success@5 on LoTTE—particularly for long-tail topics—and demonstrates strong cross-task and cross-domain generalization under low-resource settings. The core contribution is a scalable, lightweight, and domain-adaptive hard negative mining paradigm, advancing efficient adaptation of small-scale retrieval models to the biomedical domain.

Technology Category

Application Category

📝 Abstract
Hard negatives are essential for training effective retrieval models. Hard-negative mining typically relies on ranking documents using cross-encoders or static embedding models based on similarity metrics such as cosine distance. Hard negative mining becomes challenging for biomedical and scientific domains due to the difficulty in distinguishing between source and hard negative documents. However, referenced documents naturally share contextual relevance with the source document but are not duplicates, making them well-suited as hard negatives. In this work, we propose BiCA: Biomedical Dense Retrieval with Citation-Aware Hard Negatives, an approach for hard-negative mining by utilizing citation links in 20,000 PubMed articles for improving a domain-specific small dense retriever. We fine-tune the GTE_small and GTE_Base models using these citation-informed negatives and observe consistent improvements in zero-shot dense retrieval using nDCG@10 for both in-domain and out-of-domain tasks on BEIR and outperform baselines on long-tailed topics in LoTTE using Success@5. Our findings highlight the potential of leveraging document link structure to generate highly informative negatives, enabling state-of-the-art performance with minimal fine-tuning and demonstrating a path towards highly data-efficient domain adaptation.
Problem

Research questions and friction points this paper is trying to address.

Improving biomedical retrieval by using citation links as hard negatives
Addressing difficulty in distinguishing source documents from hard negatives
Enhancing domain adaptation with minimal fine-tuning through structured negatives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes citation links for hard-negative mining
Fine-tunes dense retrievers with citation-informed negatives
Leverages document link structure for domain adaptation
🔎 Similar Papers
No similar papers found.
Aarush Sinha
Aarush Sinha
University of Copenhagen
Natural Language ProcessingInformation RetrievalMachine LearningMultimodality
Pavan Kumar
Pavan Kumar
Indian Institute of Science, Bengaluru, India
Classical and Quantum Error-Correcting Codes
R
Roshan Balaji
BioSystems Engineering and Control (BiSECt) Lab, Department of Biotechnology and Wadhwani School of Data Science and AI, Indian Institute of Technology (IIT) Madras, Tamil Nadu India
N
N. Bhatt
BioSystems Engineering and Control (BiSECt) Lab, Department of Biotechnology and Wadhwani School of Data Science and AI, Indian Institute of Technology (IIT) Madras, Tamil Nadu India