Diagnosing CFG Interpretation in LLMs

πŸ“… 2026-04-22
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

196K/year
πŸ€– AI Summary
This work addresses the challenge large language models face in maintaining structural semantic consistency when interpreting dynamically defined context-free grammars. The authors propose RoboGrid, a framework that systematically evaluates hierarchical state-tracking capabilities by decoupling syntactic, behavioral, and semantic dimensions through controlled stress tests involving recursive depth, expressive complexity, and surface stylistic variation. Integrating context-free grammar stress testing, chain-of-thought reasoning, artificially constructed β€œalien” lexicons, and structural density analysis, the study reveals that while models preserve surface-level grammaticality, their semantic alignment rapidly degrades under deep recursion or highly branching structures. Furthermore, models rely heavily on keyword semantic cues rather than purely symbolic induction, exposing fundamental limitations in their capacity for formal language comprehension.

Technology Category

Application Category

πŸ“ Abstract
As LLMs are increasingly integrated into agentic systems, they must adhere to dynamically defined, machine-interpretable interfaces. We evaluate LLMs as in-context interpreters: given a novel context-free grammar, can LLMs generate syntactically valid, behaviorally functional, and semantically faithful outputs? We introduce RoboGrid, a framework that disentangles syntax, behavior, and semantics through controlled stress-tests of recursion depth, expression complexity, and surface styles. Our experiments reveal a consistent hierarchical degradation: LLMs often maintain surface syntax but fail to preserve structural semantics. Despite the partial mitigation provided by CoT reasoning, performance collapses under structural density, specifically deep recursion and high branching, with semantic alignment vanishing at extreme depths. Furthermore, "Alien" lexicons reveal that LLMs rely on semantic bootstrapping from keywords rather than pure symbolic induction. These findings pinpoint critical gaps in hierarchical state-tracking required for reliable, grammar-agnostic agents.
Problem

Research questions and friction points this paper is trying to address.

CFG interpretation
hierarchical state-tracking
semantic alignment
recursion depth
grammar-agnostic agents
Innovation

Methods, ideas, or system contributions that make the work stand out.

context-free grammar
hierarchical state-tracking
semantic bootstrapping
controlled stress-testing
in-context interpretation
πŸ”Ž Similar Papers