🤖 AI Summary
This work addresses the inefficiency and suboptimal performance of dense retrieval models in domain adaptation, where training data often exhibit redundancy and uneven contribution. To tackle this, we propose OPERA, a framework that integrates similarity-based static pruning with query-document-level dynamic sampling probability adjustment to efficiently select high-quality training samples during fine-tuning. OPERA explicitly reveals the trade-off between data quality and query coverage and introduces a two-stage dynamic pruning mechanism that enhances model performance while preserving data diversity. Evaluated across eight cross-domain datasets, OPERA achieves an average improvement of 1.9% in NDCG@10 and 0.7% in Recall@20, reduces training time by over 50%, and attains a strong overall ranking of 1.38, significantly outperforming existing approaches.
📝 Abstract
Domain-specific finetuning is essential for dense retrievers, yet not all training pairs contribute equally to the learning process. We introduce OPERA, a data pruning framework that exploits this heterogeneity to improve both the effectiveness and efficiency of retrieval model adaptation. We first investigate static pruning (SP), which retains only high-similarity query-document pairs, revealing an intrinsic quality-coverage tradeoff: ranking (NDCG) improves while retrieval (Recall) can degrade due to reduced query diversity. To resolve this tradeoff, we propose a two-stage dynamic pruning (DP) strategy that adaptively modulates sampling probabilities at both query and document levels throughout training, prioritizing high-quality examples while maintaining access to the full training set. Evaluations across eight datasets spanning six domains demonstrate the effectiveness of both approaches: SP improves ranking over standard finetuning (NDCG@10 +0.5\%), while DP achieves the strongest performance on both ranking (NDCG@10 +1.9\%) and retrieval (Recall@20 +0.7\%), with an average rank of 1.38 across all methods. These findings scale to Qwen3-Embedding, an LLM-based dense retriever, confirming architecture-agnostic benefits. Notably, DP reaches comparable performance in less than 50\% of the training time required by standard finetuning.