BiCA: Effective Biomedical Dense Retrieval with Citation-Aware Hard Negatives

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

To address the challenges of hard negative mining and limited performance of small models in biomedical dense retrieval, this paper proposes a citation-aware hard negative construction method. Leveraging citation relationships among PubMed articles, it automatically identifies highly relevant, non-redundant, domain-specific hard negatives. Crucially, it is the first to explicitly model citation structure as a signal for hard negative generation—requiring no manual annotation or auxiliary training and enabling zero-shot transfer. Evaluated on GTE-small and GTE-base, the method significantly improves nDCG@10 and Success@5 across BEIR and LoTTE benchmarks; notably, it achieves state-of-the-art Success@5 on LoTTE—particularly for long-tail topics—and demonstrates strong cross-task and cross-domain generalization under low-resource settings. The core contribution is a scalable, lightweight, and domain-adaptive hard negative mining paradigm, advancing efficient adaptation of small-scale retrieval models to the biomedical domain.

Technology Category

Application Category

📝 Abstract

Hard negatives are essential for training effective retrieval models. Hard-negative mining typically relies on ranking documents using cross-encoders or static embedding models based on similarity metrics such as cosine distance. Hard negative mining becomes challenging for biomedical and scientific domains due to the difficulty in distinguishing between source and hard negative documents. However, referenced documents naturally share contextual relevance with the source document but are not duplicates, making them well-suited as hard negatives. In this work, we propose BiCA: Biomedical Dense Retrieval with Citation-Aware Hard Negatives, an approach for hard-negative mining by utilizing citation links in 20,000 PubMed articles for improving a domain-specific small dense retriever. We fine-tune the GTE_small and GTE_Base models using these citation-informed negatives and observe consistent improvements in zero-shot dense retrieval using nDCG@10 for both in-domain and out-of-domain tasks on BEIR and outperform baselines on long-tailed topics in LoTTE using Success@5. Our findings highlight the potential of leveraging document link structure to generate highly informative negatives, enabling state-of-the-art performance with minimal fine-tuning and demonstrating a path towards highly data-efficient domain adaptation.

Problem

Research questions and friction points this paper is trying to address.

Improving biomedical retrieval by using citation links as hard negatives

Addressing difficulty in distinguishing source documents from hard negatives

Enhancing domain adaptation with minimal fine-tuning through structured negatives

Innovation

Methods, ideas, or system contributions that make the work stand out.

Utilizes citation links for hard-negative mining

Fine-tunes dense retrievers with citation-informed negatives

Leverages document link structure for domain adaptation

🔎 Similar Papers

No similar papers found.