LG-ANNA-Embedding technical report

📅 2025-06-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of learning general-purpose text embeddings that perform well across both information retrieval (IR) and non-IR tasks without task-specific fine-tuning. Methodologically, it proposes an instruction-driven, decoder-only embedding framework built upon Mistral-7B, incorporating structured instruction prompting and in-context learning. It introduces soft-label supervision—leveraging continuous relevance scores from high-performance re-rankers—and designs an adaptive margin-based hard negative mining strategy to enhance semantic discriminability and training stability. Evaluated on MTEB v2 (41 tasks), the approach achieves top-tier Borda scores, outperforming several larger-parameter or fully fine-tuned baselines. These results validate the effectiveness of lightweight, generalizable, and fine-tuning-free embedding learning under the instruction paradigm.

Technology Category

Application Category

📝 Abstract
This report presents a unified instruction-based framework for learning generalized text embeddings optimized for both information retrieval (IR) and non-IR tasks. Built upon a decoder-only large language model (Mistral-7B), our approach combines in-context learning, soft supervision, and adaptive hard-negative mining to generate context-aware embeddings without task-specific fine-tuning. Structured instructions and few-shot examples are used to guide the model across diverse tasks, enabling strong performance on classification, semantic similarity, clustering, and reranking benchmarks. To improve semantic discrimination, we employ a soft labeling framework where continuous relevance scores, distilled from a high-performance dense retriever and reranker, serve as fine-grained supervision signals. In addition, we introduce adaptive margin-based hard-negative mining, which filters out semantically ambiguous negatives based on their similarity to positive examples, thereby enhancing training stability and retrieval robustness. Our model is evaluated on the newly introduced MTEB (English, v2) benchmark, covering 41 tasks across seven categories. Results show that our method achieves strong generalization and ranks among the top-performing models by Borda score, outperforming several larger or fully fine-tuned baselines. These findings highlight the effectiveness of combining in-context prompting, soft supervision, and adaptive sampling for scalable, high-quality embedding generation.
Problem

Research questions and friction points this paper is trying to address.

Develop unified framework for generalized text embeddings
Optimize embeddings for both IR and non-IR tasks
Enhance semantic discrimination via adaptive negative mining
Innovation

Methods, ideas, or system contributions that make the work stand out.

Unified instruction-based framework for text embeddings
Soft supervision with continuous relevance scores
Adaptive margin-based hard-negative mining
🔎 Similar Papers
No similar papers found.