The Dog the Cat Chased Stumped the Model: Measuring When Language Models Abandon Structure for Shortcuts

📅 2025-10-23

📈 Citations: 0

✨ Influential: 0

career value

146K/year

🤖 AI Summary

This study investigates whether language models genuinely perform syntactic parsing or rely predominantly on semantic shortcuts to comprehend complex sentences. Method: We introduce CenterBench—the first benchmark specifically designed to evaluate center-embedding structures—by generating syntactically valid yet semantically absurd recursively nested sentences, thereby isolating structural comprehension from semantic matching. Our evaluation framework comprises six task categories spanning surface-level understanding, syntactic dependency parsing, and causal reasoning, augmented with chain-of-thought analysis. Contribution/Results: Experiments reveal a performance gap of up to 26.8% between semantically plausible and implausible samples as embedding depth increases, demonstrating systematic reliance on semantic shortcuts across mainstream models. Even state-of-the-art reasoning models exhibit persistent shortcut dependence and evasion behaviors (e.g., refusing answers). This work provides the first quantitative characterization of the critical failure point in neural models’ structural reasoning capacity.

Technology Category

Application Category

📝 Abstract

When language models correctly parse "The cat that the dog chased meowed," are they analyzing syntax or simply familiar with dogs chasing cats? Despite extensive benchmarking, we lack methods to distinguish structural understanding from semantic pattern matching. We introduce CenterBench, a dataset of 9,720 comprehension questions on center-embedded sentences (like "The cat [that the dog chased] meowed") where relative clauses nest recursively, creating processing demands from simple to deeply nested structures. Each sentence has a syntactically identical but semantically implausible counterpart (e.g., mailmen prescribe medicine, doctors deliver mail) and six comprehension questions testing surface understanding, syntactic dependencies, and causal reasoning. Testing six models reveals that performance gaps between plausible and implausible sentences widen systematically with complexity, with models showing median gaps up to 26.8 percentage points, quantifying when they abandon structural analysis for semantic associations. Notably, semantic plausibility harms performance on questions about resulting actions, where following causal relationships matters more than semantic coherence. Reasoning models improve accuracy but their traces show semantic shortcuts, overthinking, and answer refusal. Unlike models whose plausibility advantage systematically widens with complexity, humans shows variable semantic effects. CenterBench provides the first framework to identify when models shift from structural analysis to pattern matching.

Problem

Research questions and friction points this paper is trying to address.

Distinguish structural understanding from semantic pattern matching in language models

Measure when models abandon syntax for semantic shortcuts with complexity

Test model performance on plausible versus implausible center-embedded sentences

Innovation

Methods, ideas, or system contributions that make the work stand out.

CenterBench dataset tests structural understanding

Measures plausibility gap across complexity levels

Quantifies shift from syntax to semantic shortcuts

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models