🤖 AI Summary
Existing text embedding benchmarks primarily evaluate semantic similarity, failing to adequately assess higher-order capabilities such as factual consistency, safety, instruction following, reasoning, and document-level understanding. To address this gap, we introduce the first comprehensive benchmark targeting these five advanced capabilities. Our method innovatively reformulates classification tasks as retrieval tasks, leveraging semantic retrieval frameworks to compel embedding models to deepen contextual and content understanding. We adopt a retrieval-oriented single-task fine-tuning strategy, integrating multi-dimensional real-world task designs with mixed information retrieval data for training. Experiments demonstrate substantial improvements: +8% accuracy on factual consistency classification and +13% on safety classification, relative to standard baselines. The benchmark, along with all code and datasets, is fully open-sourced—establishing a new paradigm and empirical foundation for evaluating and optimizing embedding models’ higher-order competencies.
📝 Abstract
Traditional text embedding benchmarks primarily evaluate embedding models' capabilities to capture semantic similarity. However, more advanced NLP tasks require a deeper understanding of text, such as safety and factuality. These tasks demand an ability to comprehend and process complex information, often involving the handling of sensitive content, or the verification of factual statements against reliable sources. We introduce a new benchmark designed to assess and highlight the limitations of embedding models trained on existing information retrieval data mixtures on advanced capabilities, which include factuality, safety, instruction following, reasoning and document-level understanding. This benchmark includes a diverse set of tasks that simulate real-world scenarios where these capabilities are critical and leads to identification of the gaps of the currently advanced embedding models. Furthermore, we propose a novel method that reformulates these various tasks as retrieval tasks. By framing tasks like safety or factuality classification as retrieval problems, we leverage the strengths of retrieval models in capturing semantic relationships while also pushing them to develop a deeper understanding of context and content. Using this approach with single-task fine-tuning, we achieved performance gains of 8% on factuality classification and 13% on safety classification. Our code and data will be publicly available.