SAGE: A Realistic Benchmark for Semantic Understanding

📅 2025-09-25

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

Existing LLM evaluation benchmarks inadequately capture the essence of semantic understanding. To address this, we propose SAGE—the first rigorous, multidimensional benchmark explicitly designed to assess semantic understanding capabilities. SAGE systematically evaluates five core dimensions—human preference alignment, transformation robustness, information sensitivity, clustering robustness, and retrieval robustness—within a unified framework. It integrates 30+ datasets, human judgments, adversarial examples, and noise perturbations, and benchmarks nine embedding models alongside traditional metrics (e.g., Jaccard). Key findings reveal strong trade-offs among semantic capabilities: text-embedding-3-large achieves the highest preference alignment (0.682), yet its smaller variant exhibits severe robustness degradation (0.011); Jaccard significantly outperforms all embedding models on information sensitivity (0.905 vs. 0.794). SAGE uncovers fundamental limitations and inherent tensions in current semantic representations, establishing a new paradigm for embedding model evaluation and design.

Technology Category

Application Category

📝 Abstract

As large language models (LLMs) achieve strong performance on traditional benchmarks, there is an urgent need for more challenging evaluation frameworks that probe deeper aspects of semantic understanding. We introduce SAGE (Semantic Alignment & Generalization Evaluation), a rigorous benchmark designed to assess both embedding models and similarity metrics across five categories: Human Preference Alignment, Transformation Robustness, Information Sensitivity, Clustering Performance, and Retrieval Robustness. Unlike existing benchmarks that focus on isolated capabilities, SAGE evaluates semantic understanding through adversarial conditions, noisy transformations, and nuanced human judgment tasks across 30+ datasets. Our comprehensive evaluation of 9 embedding models and classical metrics reveals significant performance gaps, with no single approach excelling across all dimensions. For instance, while state-of-the-art embedding models like OpenAI's text-embedding-3-large dominate in aligning with human preferences (0.682 vs. 0.591 for the best classical metric), they are significantly outperformed by classical metrics on information sensitivity tasks, where Jaccard Similarity achieves a score of 0.905 compared to the top embedding score of 0.794. SAGE further uncovers critical trade-offs: OpenAI's text-embedding-3-small achieves the highest clustering performance (0.483) but demonstrates extreme brittleness with the lowest robustness score (0.011). SAGE exposes critical limitations in current semantic understanding capabilities and provides a more realistic assessment of model robustness for real-world deployment.

Problem

Research questions and friction points this paper is trying to address.

Creating a challenging benchmark for deeper semantic understanding evaluation

Assessing embedding models and similarity metrics across five categories

Evaluating semantic understanding under adversarial and noisy conditions

Innovation

Methods, ideas, or system contributions that make the work stand out.

Benchmark evaluates embedding models and similarity metrics

Tests semantic understanding across five challenging categories

Uses adversarial conditions and noisy transformations

🔎 Similar Papers

No similar papers found.