🤖 AI Summary
This paper addresses the negative transfer and performance trade-offs arising from objective misalignment between information retrieval (IR) and semantic textual similarity (STS) tasks. We propose a unified embedding learning framework that is collaborative yet task-separated. Our method introduces: (1) a dynamic cross-task sampling mechanism that decouples IR’s ranking-oriented preferences from STS’s geometric consistency constraints; (2) task-specialized contrastive losses coupled with a delta-guided model fusion strategy to explicitly mitigate gradient conflicts between tasks; and (3) a single-stage end-to-end training pipeline integrating multi-positive mining, hard negative sampling across devices, and ranking-consistency optimization. Evaluated on 15 standard benchmarks, our approach significantly alleviates multi-task interference, yields embeddings with superior geometric properties, and consistently outperforms conventional model ensembling and joint training methods—achieving improved convergence stability and strong generalization.
📝 Abstract
Learning unified text embeddings that excel across diverse downstream tasks is a central goal in representation learning, yet negative transfer remains a persistent obstacle. This challenge is particularly pronounced when jointly training a single encoder for Information Retrieval (IR) and Semantic Textual Similarity (STS), two essential but fundamentally disparate tasks for which naive co-training typically yields steep performance trade-offs. We argue that resolving this conflict requires systematically decoupling task-specific learning signals throughout the training pipeline. To this end, we introduce CoDiEmb, a unified framework that reconciles the divergent requirements of IR and STS in a collaborative yet distinct manner. CoDiEmb integrates three key innovations for effective joint optimization: (1) Task-specialized objectives paired with a dynamic sampler that forms single-task batches and balances per-task updates, thereby preventing gradient interference. For IR, we employ a contrastive loss with multiple positives and hard negatives, augmented by cross-device sampling. For STS, we adopt order-aware objectives that directly optimize correlation and ranking consistency. (2) A delta-guided model fusion strategy that computes fine-grained merging weights for checkpoints by analyzing each parameter's deviation from its pre-trained initialization, proving more effective than traditional Model Soups. (3) An efficient, single-stage training pipeline that is simple to implement and converges stably. Extensive experiments on 15 standard IR and STS benchmarks across three base encoders validate CoDiEmb. Our results and analysis demonstrate that the framework not only mitigates cross-task trade-offs but also measurably improves the geometric properties of the embedding space.