Revisiting the Reliability of Language Models in Instruction-Following

📅 2025-12-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates semantic robustness of large language models (LLMs) in instruction following: whether model outputs remain consistent when user intent is preserved but prompt wording undergoes subtle variations. To this end, we introduce *nuance-oriented reliability*—a novel reliability notion centered on semantic granularity—and present IFEval++, the first benchmark explicitly designed to evaluate performance under fine-grained semantic perturbations. We further propose reliable@k, a precision-at-k metric for measuring consistency across semantically equivalent prompts. Methodologically, we design an automated data augmentation framework that generates high-quality “cousin prompts”: paraphrased variants preserving underlying intent while introducing lexical and syntactic diversity. Comprehensive evaluation across 20 closed-source and 26 open-source LLMs reveals up to 61.8% performance degradation under minimal prompt modifications. Our analysis identifies three actionable intervention strategies to enhance reliability. All code and the IFEval++ benchmark are publicly released to advance research on LLM robustness.

Technology Category

Application Category

📝 Abstract
Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.
Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' reliability with nuanced prompt variations
Introduces new metric and benchmark for consistency in instruction-following
Identifies performance drops in models due to subtle phrasing changes
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reliable@k metric for nuance consistency
Develops automated pipeline for cousin prompt generation
Constructs IFEval++ benchmark for systematic reliability evaluation
🔎 Similar Papers
No similar papers found.
Jianshuo Dong
Jianshuo Dong
Tsinghua University
Trustworthy AIExplainable AIAgent Security
Y
Yutong Zhang
Tsinghua University, China
Y
Yan Liu
Ant Group, China
Zhenyu Zhong
Zhenyu Zhong
Ant Group
security
T
Tao Wei
Ant Group, China
C
Chao Zhang
Tsinghua University, China
Han Qiu
Han Qiu
NTU