Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

📅 2026-01-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing benchmarks often conflate instruction following with task success, hindering accurate assessment of large language models’ true compliance capabilities under complex instructions. This work proposes MOSAIC, a modular framework that, for the first time, decomposes instruction compliance into independently analyzable dimensions. By dynamically synthesizing datasets incorporating up to 20 application-oriented constraints, MOSAIC enables fine-grained, disentangled evaluation. Systematic ablation studies, combined with analyses of constraint composition and positional sensitivity across five mainstream models, reveal non-uniform response patterns dependent on constraint type, count, and placement. The study identifies primacy and recency biases alongside model-specific vulnerabilities, offering critical diagnostic insights to guide the development of more reliable language models.

Technology Category

Application Category

📝 Abstract
Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability. Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights are critical for diagnosing model failures and developing more reliable LLMs for systems that demand strict adherence to complex instructions.
Problem

Research questions and friction points this paper is trying to address.

instruction-following
large language models
benchmark
instruction compliance
evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following
granular evaluation
modular benchmark
constraint compliance
LLM reliability
🔎 Similar Papers
No similar papers found.
Alberto Purpura
Alberto Purpura
Capital One
Generative AIInformation RetrievalNatural Language ProcessingSentiment Analysis
Li Wang
Li Wang
Ant Group
machine learning、MPC
S
Sahil Badyal
Card Intelligence, Capital One
E
Eugenio Beaufrand
Card Intelligence, Capital One
A
Adam Faulkner
Card Intelligence, Capital One