Meaning Beyond Truth Conditions: Evaluating Discourse Level Understanding via Anaphora Accessibility

📅 2025-02-19

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This study investigates large language models’ (LLMs) discourse-level coreference resolution capabilities to uncover fundamental disparities between model and human comprehension. To this end, we first operationalize dynamic semantics into a computationally tractable coreference accessibility benchmark, constructing a manually curated, controllable discourse-level evaluation dataset. We then systematically compare human participants’ coreference judgments against those of leading LLMs (e.g., GPT, Claude, Llama). Our contributions are threefold: (1) the first dynamic-semantic framework for discourse-level coreference accessibility evaluation; (2) empirical evidence that while LLMs approach human performance on lexically cued coreference tasks, they significantly underperform on structurally abstract coreference chain reasoning; and (3) quantitative validation that LLMs rely predominantly on shallow surface representations, lacking deep modeling of discourse dynamics—thereby advancing natural language understanding evaluation from token- and sentence-level analysis to genuine discourse semantics.

Technology Category

Application Category

📝 Abstract

We present a hierarchy of natural language understanding abilities and argue for the importance of moving beyond assessments of understanding at the lexical and sentence levels to the discourse level. We propose the task of anaphora accessibility as a diagnostic for assessing discourse understanding, and to this end, present an evaluation dataset inspired by theoretical research in dynamic semantics. We evaluate human and LLM performance on our dataset and find that LLMs and humans align on some tasks and diverge on others. Such divergence can be explained by LLMs' reliance on specific lexical items during language comprehension, in contrast to human sensitivity to structural abstractions.

Problem

Research questions and friction points this paper is trying to address.

Evaluate discourse understanding via anaphora accessibility

Compare human and LLM performance on discourse tasks

Analyze LLMs' reliance on lexical items versus human structural sensitivity

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchy of language understanding abilities

Anaphora accessibility as diagnostic tool

Dataset inspired by dynamic semantics

🔎 Similar Papers

No similar papers found.