Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

📅 2025-09-04

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Large language models (LLMs) suffer from “cognitive inertia”—a critical alignment bottleneck wherein they struggle to comply with counterintuitive instructions that contradict patterns learned during supervised fine-tuning. Method: We propose Inverse IFEval, the first systematic benchmark for evaluating instruction-following flexibility, comprising 1,012 high-quality, multilingual (Chinese/English), multi-domain samples across eight reverse-instruction challenge categories (e.g., question correction, deliberate text errors). It employs human-AI collaborative annotation and an optimized LLM-as-a-Judge evaluation paradigm for cross-lingual, fine-grained assessment. Contribution/Results: Experiments reveal substantial performance degradation of mainstream LLMs on reverse instructions, empirically confirming the ubiquity of cognitive inertia. This work formally introduces “instruction-following flexibility” as a novel alignment dimension—complementing fluency and factual consistency—and provides both theoretical grounding and empirical evidence to guide next-generation alignment methods capable of robust adaptation to unconventional scenarios.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.

Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to override training-induced biases

Measuring capacity to follow adversarial counter-intuitive instructions

Assessing adaptability under unconventional contextual scenarios

Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Inverse IFEval benchmark for counter-intuitive instruction evaluation

Uses human-in-the-loop pipeline creating multilingual adversarial questions

Implements optimized LLM-as-a-Judge framework for automated assessment

🔎 Similar Papers

No similar papers found.