RELIC: Evaluating Compositional Instruction Following via Language Recognition

📅 2025-06-05

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work investigates the ability of large language models (LLMs) to follow complex compositional instructions—such as recognizing formal languages based on context-free grammars—in a zero-shot setting using only natural-language task descriptions. We find that state-of-the-art models perform near-chance on highly compositional grammar tasks, and systematically degrade to shallow heuristic reasoning as syntactic complexity increases. Method: We propose RELIC, a framework featuring a scalable, automatically generated synthetic grammar benchmark enabling controlled variation in both grammatical and string-level complexity, alongside multi-step contextual instruction execution. Crucially, RELIC integrates formal language theory deeply into instruction-following evaluation, enabling data-contamination-free, systematic diagnosis. Contribution/Results: Experiments demonstrate that model accuracy is reliably predicted by formal measures of grammar and string complexity. For the first time, we quantitatively confirm—and characterize—the degradation of compositional reasoning strategies under multi-rule conjunction, revealing fundamental limitations in current LLMs’ structural generalization.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly expected to perform tasks based only on a specification of the task provided in context, without examples of inputs and outputs; this ability is referred to as instruction following. We introduce the Recognition of Languages In-Context (RELIC) framework to evaluate instruction following using language recognition: the task of determining if a string is generated by formal grammar. Unlike many standard evaluations of LLMs' ability to use their context, this task requires composing together a large number of instructions (grammar productions) retrieved from the context. Because the languages are synthetic, the task can be increased in complexity as LLMs' skills improve, and new instances can be automatically generated, mitigating data contamination. We evaluate state-of-the-art LLMs on RELIC and find that their accuracy can be reliably predicted from the complexity of the grammar and the individual example strings, and that even the most advanced LLMs currently available show near-chance performance on more complex grammars and samples, in line with theoretical expectations. We also use RELIC to diagnose how LLMs attempt to solve increasingly difficult reasoning tasks, finding that as the complexity of the language recognition task increases, models switch to relying on shallow heuristics instead of following complex instructions.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to follow compositional instructions via language recognition

Assessing LLMs' performance on synthetic grammar-based tasks to avoid data contamination

Diagnosing LLMs' reliance on heuristics versus complex instruction following

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses synthetic languages for scalable evaluation

Measures LLM instruction following via grammar recognition

Predicts accuracy based on grammar and string complexity

🔎 Similar Papers

No similar papers found.