🤖 AI Summary
This study investigates whether large language models genuinely comprehend the compositional semantics of events or merely rely on superficial heuristics, with a focus on their ability to distinguish between activity and achievement predicates in the context of the imperfective paradox. To this end, we construct ImperfectiveNLI, a diagnostic dataset that integrates natural language inference, representational analysis, and prompt-based interventions to systematically evaluate prominent open-source models. Our work reveals, for the first time, a “teleological bias” in these models—persistently inferring goal completion even when explicitly negated—and a pervasive “completive illusion”: while models internally differentiate between ongoing processes and completed outcomes, their reasoning is dominated by strong prior assumptions about event culmination. Although prompt interventions mitigate this bias, they simultaneously impair the models’ capacity to recognize valid entailments, indicating a fundamental lack of structured aspectual awareness.
📝 Abstract
Do Large Language Models (LLMs) genuinely grasp the compositional semantics of events, or do they rely on surface-level probabilistic heuristics? We investigate the Imperfective Paradox, a logical phenomenon where the past progressive aspect entails event realization for activities (e.g., running $\to$ ran) but not for accomplishments (e.g., building $\nrightarrow$ built). We introduce ImperfectiveNLI, a diagnostic dataset designed to probe this distinction across diverse semantic classes. Evaluating state-of-the-art open-weight models, we uncover a pervasive Teleological Bias: models systematically hallucinate completion for goal-oriented events, often overriding explicit textual negation. Representational analyses show that while internal embeddings often distinguish process from result, inference decisions are dominated by strong priors about goal attainment. We further find that prompting-based interventions reduce hallucinated completions but also increase incorrect rejections of valid entailments. Our findings suggest that current LLMs lack structural aspectual awareness, operating as predictive narrative engines rather than faithful logical reasoners.