🤖 AI Summary
Current NLP systems lack quantifiable, reference-free metrics for evaluating implicitness—a critical bottleneck for deep analysis of language understanding capabilities. To address this, we propose ImpScore: the first learnable scalar metric that formally defines implicitness via semantic-pragmatic divergence. Methodologically, we construct the first human-annotated dataset of implicit/explicit sentence pairs and develop a linguistics-informed, interpretable regression model coupled with pairwise contrastive learning, enabling zero-shot, reference-free evaluation. Experiments demonstrate that ImpScore achieves high agreement with human judgments on out-of-distribution data (Spearman ρ > 0.87), exhibits strong interpretability and human-machine consistency, and reveals a significant degradation in mainstream large language models’ comprehension of highly implicit hate speech. The code and models are publicly released.
📝 Abstract
Handling implicit language is essential for natural language processing systems to achieve precise text understanding and facilitate natural interactions with users. Despite its importance, the absence of a metric for accurately measuring the implicitness of language significantly constrains the depth of analysis possible in evaluating models' comprehension capabilities. This paper addresses this gap by developing a scalar metric that quantifies the implicitness level of language without relying on external references. Drawing on principles from traditional linguistics, we define"implicitness"as the divergence between semantic meaning and pragmatic interpretation. To operationalize this definition, we introduce ImpScore, a reference-free metric formulated through an interpretable regression model. This model is trained using pairwise contrastive learning on a specially curated dataset consisting of (implicit sentence, explicit sentence) pairs. We validate ImpScore through a user study that compares its assessments with human evaluations on out-of-distribution data, demonstrating its accuracy and strong correlation with human judgments. Additionally, we apply ImpScore to hate speech detection datasets, illustrating its utility and highlighting significant limitations in current large language models' ability to understand highly implicit content. Our metric is publicly available at https://github.com/audreycs/ImpScore.