Revisiting the Reliability of Language Models in Instruction-Following

📅 2025-12-14

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

This work investigates semantic robustness of large language models (LLMs) in instruction following: whether model outputs remain consistent when user intent is preserved but prompt wording undergoes subtle variations. To this end, we introduce *nuance-oriented reliability*—a novel reliability notion centered on semantic granularity—and present IFEval++, the first benchmark explicitly designed to evaluate performance under fine-grained semantic perturbations. We further propose reliable@k, a precision-at-k metric for measuring consistency across semantically equivalent prompts. Methodologically, we design an automated data augmentation framework that generates high-quality “cousin prompts”: paraphrased variants preserving underlying intent while introducing lexical and syntactic diversity. Comprehensive evaluation across 20 closed-source and 26 open-source LLMs reveals up to 61.8% performance degradation under minimal prompt modifications. Our analysis identifies three actionable intervention strategies to enhance reliability. All code and the IFEval++ benchmark are publicly released to advance research on LLM robustness.

Technology Category

Application Category

📝 Abstract

Advanced LLMs have achieved near-ceiling instruction-following accuracy on benchmarks such as IFEval. However, these impressive scores do not necessarily translate to reliable services in real-world use, where users often vary their phrasing, contextual framing, and task formulations. In this paper, we study nuance-oriented reliability: whether models exhibit consistent competence across cousin prompts that convey analogous user intents but with subtle nuances. To quantify this, we introduce a new metric, reliable@k, and develop an automated pipeline that generates high-quality cousin prompts via data augmentation. Building upon this, we construct IFEval++ for systematic evaluation. Across 20 proprietary and 26 open-source LLMs, we find that current models exhibit substantial insufficiency in nuance-oriented reliability -- their performance can drop by up to 61.8% with nuanced prompt modifications. What's more, we characterize it and explore three potential improvement recipes. Our findings highlight nuance-oriented reliability as a crucial yet underexplored next step toward more dependable and trustworthy LLM behavior. Our code and benchmark are accessible: https://github.com/jianshuod/IFEval-pp.

Problem

Research questions and friction points this paper is trying to address.

Evaluates LLMs' reliability with nuanced prompt variations

Introduces new metric and benchmark for consistency in instruction-following

Identifies performance drops in models due to subtle phrasing changes

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces reliable@k metric for nuance consistency

Develops automated pipeline for cousin prompt generation

Constructs IFEval++ benchmark for systematic reliability evaluation

🔎 Similar Papers

No similar papers found.