🤖 AI Summary
This study investigates large language models’ (LLMs) discourse-level coreference resolution capabilities to uncover fundamental disparities between model and human comprehension. To this end, we first operationalize dynamic semantics into a computationally tractable coreference accessibility benchmark, constructing a manually curated, controllable discourse-level evaluation dataset. We then systematically compare human participants’ coreference judgments against those of leading LLMs (e.g., GPT, Claude, Llama). Our contributions are threefold: (1) the first dynamic-semantic framework for discourse-level coreference accessibility evaluation; (2) empirical evidence that while LLMs approach human performance on lexically cued coreference tasks, they significantly underperform on structurally abstract coreference chain reasoning; and (3) quantitative validation that LLMs rely predominantly on shallow surface representations, lacking deep modeling of discourse dynamics—thereby advancing natural language understanding evaluation from token- and sentence-level analysis to genuine discourse semantics.
📝 Abstract
We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs' reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.