🤖 AI Summary
Current language models lack systematic evaluation on their ability to understand multiword expressions—such as idioms, noun compounds, and verb constructions—that involve deep semantic processing. This work proposes SemanticQA, a benchmark suite that, for the first time, unifies disparate multiword expression resources into a structured semantic reasoning framework encompassing four task types: extraction, classification, interpretation, and composition. Designed to support comprehensive evaluation across diverse model architectures and scales, SemanticQA enables rigorous assessment of semantic competence. Experimental results reveal significant deficiencies in existing models’ capacity to handle non-literal meanings and complex syntactic-semantic structures, thereby offering both empirical evidence and a foundational benchmark for advancing language models’ semantic reasoning capabilities.
📝 Abstract
We present SemanticQA, an evaluation suite designed to assess language models (LMs) in semantic phrase processing tasks. The benchmark consolidates existing multiword expression (MwE) resources and reorganizes them into a unified testbed. It covers both general lexical phenomena, such as lexical collocations, and three fine-grained categories: idiomatic expressions, noun compounds, and verbal constructions. Through SemanticQA, we assess LMs of diverse architectures and scales in extraction, classification, and interpretation tasks, as well as sequential task compositions. We reveal substantial performance variation, particularly on tasks requiring semantic reasoning, highlighting differences in reasoning efficacy and semantic understanding of LMs, providing insights for pushing LMs with stronger comprehension on non-trivial semantic phrases. The evaluation harness and data of SemanticQA are available at https://github.com/jacklanda/SemanticQA.