Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates large language models’ (LLMs) discourse-level coreference resolution capabilities to uncover fundamental disparities between model and human comprehension. To this end, we first operationalize dynamic semantics into a computationally tractable coreference accessibility benchmark, constructing a manually curated, controllable discourse-level evaluation dataset. We then systematically compare human participants’ coreference judgments against those of leading LLMs (e.g., GPT, Claude, Llama). Our contributions are threefold: (1) the first dynamic-semantic framework for discourse-level coreference accessibility evaluation; (2) empirical evidence that while LLMs approach human performance on lexically cued coreference tasks, they significantly underperform on structurally abstract coreference chain reasoning; and (3) quantitative validation that LLMs rely predominantly on shallow surface representations, lacking deep modeling of discourse dynamics—thereby advancing natural language understanding evaluation from token- and sentence-level analysis to genuine discourse semantics.

Technology Category

Application Category

📝 Abstract
We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs' reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.
Problem

Research questions and friction points this paper is trying to address.

Evaluate discourse understanding via anaphora accessibility
Compare human and LLM performance on discourse tasks
Analyze LLMs' reliance on lexical items versus human structural sensitivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchy of language understanding abilities
Anaphora accessibility as diagnostic tool
Dataset inspired by dynamic semantics
🔎 Similar Papers
No similar papers found.
X
Xiaomeng Zhu
Department of Linguistics, Yale University
Z
Zhenghao Zhou
Department of Linguistics, Yale University
S
Simon Charlow
Department of Linguistics, Yale University
Robert Frank
Robert Frank
Professor of Linguistics, Yale University
SyntaxMathematical LinguisticsComputational LinguisticsTree Adjoining GrammarLanguage Acquisition and Processing