🤖 AI Summary
This work addresses the challenge that language models struggle to comprehend idioms, whose meanings cannot be inferred from their literal constituents. To this end, the authors introduce IdioLink—the first cross-surface semantic retrieval benchmark specifically designed for idioms, comprising 10,700 documents and 2,140 queries, each paired with either literal or paraphrased expressions of idiomatic meaning. Core semantic spans are annotated to enable fine-grained evaluation. Experiments with strong embedding models—including BGE, E5, Contriever, and Qwen—reveal limited performance in capturing the deep semantic equivalence between idiomatic and literal expressions, highlighting a tendency to rely on superficial lexical cues rather than genuine semantic understanding. This study thus establishes a new benchmark and analytical framework for advancing cross-surface semantic comprehension in natural language processing.
📝 Abstract
Idioms pose a fundamental challenge for language models, as their meaning cannot be inferred from surface form alone. Understanding such expressions, therefore, requires semantic abstraction beyond lexical overlap. We introduce IdioLink, a retrieval benchmark designed to test whether models can link idiomatic expressions to conceptually equivalent meanings expressed in literal or paraphrased forms. IdioLink comprises 10,700 documents and 2,140 queries, spanning 107 idioms with both literal and figurative uses. Each document and query is annotated with spans that convey the core meaning. Evaluating strong embedding baselines (e.g., BGE, E5, Contriever, and Qwen), we show that current models struggle to retrieve equivalent meanings across divergent surface realizations, relying instead on topical and shallow semantic cues. IdioLink exposes key gaps in idiom-aware semantic retrieval and provides a challenging testbed for future models.