Deconstructing Instruction-Following: A New Benchmark for Granular Evaluation of Large Language Model Instruction Compliance Abilities

📅 2026-01-26

📈 Citations: 0

✨ Influential: 0

career value

177K/year

🤖 AI Summary

Existing benchmarks often conflate instruction following with task success, hindering accurate assessment of large language models’ true compliance capabilities under complex instructions. This work proposes MOSAIC, a modular framework that, for the first time, decomposes instruction compliance into independently analyzable dimensions. By dynamically synthesizing datasets incorporating up to 20 application-oriented constraints, MOSAIC enables fine-grained, disentangled evaluation. Systematic ablation studies, combined with analyses of constraint composition and positional sensitivity across five mainstream models, reveal non-uniform response patterns dependent on constraint type, count, and placement. The study identifies primacy and recency biases alongside model-specific vulnerabilities, offering critical diagnostic insights to guide the development of more reliable language models.

Technology Category

Application Category

📝 Abstract

Reliably ensuring Large Language Models (LLMs) follow complex instructions is a critical challenge, as existing benchmarks often fail to reflect real-world use or isolate compliance from task success. We introduce MOSAIC (MOdular Synthetic Assessment of Instruction Compliance), a modular framework that uses a dynamically generated dataset with up to 20 application-oriented generation constraints to enable a granular and independent analysis of this capability. Our evaluation of five LLMs from different families based on this new benchmark demonstrates that compliance is not a monolithic capability but varies significantly with constraint type, quantity, and position. The analysis reveals model-specific weaknesses, uncovers synergistic and conflicting interactions between instructions, and identifies distinct positional biases such as primacy and recency effects. These granular insights are critical for diagnosing model failures and developing more reliable LLMs for systems that demand strict adherence to complex instructions.

Problem

Research questions and friction points this paper is trying to address.

instruction-following

large language models

benchmark

instruction compliance

evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

instruction-following

granular evaluation

modular benchmark