Inverse IFEval: Can LLMs Unlearn Stubborn Training Conventions to Follow Real Instructions?

📅 2025-09-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) suffer from “cognitive inertia”—a critical alignment bottleneck wherein they struggle to comply with counterintuitive instructions that contradict patterns learned during supervised fine-tuning. Method: We propose Inverse IFEval, the first systematic benchmark for evaluating instruction-following flexibility, comprising 1,012 high-quality, multilingual (Chinese/English), multi-domain samples across eight reverse-instruction challenge categories (e.g., question correction, deliberate text errors). It employs human-AI collaborative annotation and an optimized LLM-as-a-Judge evaluation paradigm for cross-lingual, fine-grained assessment. Contribution/Results: Experiments reveal substantial performance degradation of mainstream LLMs on reverse instructions, empirically confirming the ubiquity of cognitive inertia. This work formally introduces “instruction-following flexibility” as a novel alignment dimension—complementing fluency and factual consistency—and provides both theoretical grounding and empirical evidence to guide next-generation alignment methods capable of robust adaptation to unconventional scenarios.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) achieve strong performance on diverse tasks but often exhibit cognitive inertia, struggling to follow instructions that conflict with the standardized patterns learned during supervised fine-tuning (SFT). To evaluate this limitation, we propose Inverse IFEval, a benchmark that measures models Counter-intuitive Abilitytheir capacity to override training-induced biases and comply with adversarial instructions. Inverse IFEval introduces eight types of such challenges, including Question Correction, Intentional Textual Flaws, Code without Comments, and Counterfactual Answering. Using a human-in-the-loop pipeline, we construct a dataset of 1012 high-quality Chinese and English questions across 23 domains, evaluated under an optimized LLM-as-a-Judge framework. Experiments on existing leading LLMs demonstrate the necessity of our proposed Inverse IFEval benchmark. Our findings emphasize that future alignment efforts should not only pursue fluency and factual correctness but also account for adaptability under unconventional contexts. We hope that Inverse IFEval serves as both a diagnostic tool and a foundation for developing methods that mitigate cognitive inertia, reduce overfitting to narrow patterns, and ultimately enhance the instruction-following reliability of LLMs in diverse and unpredictable real-world scenarios.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to override training-induced biases
Measuring capacity to follow adversarial counter-intuitive instructions
Assessing adaptability under unconventional contextual scenarios
Innovation

Methods, ideas, or system contributions that make the work stand out.

Proposes Inverse IFEval benchmark for counter-intuitive instruction evaluation
Uses human-in-the-loop pipeline creating multilingual adversarial questions
Implements optimized LLM-as-a-Judge framework for automated assessment
🔎 Similar Papers
No similar papers found.
Q
Qinyan Zhang
ByteDance
X
Xinping Lei
ByteDance
Ruijie Miao
Ruijie Miao
Peking University
Computer Science
Y
Yu Fu
ByteDance
H
Haojie Fan
ByteDance
L
Le Chang
ByteDance
J
Jiafan Hou
ByteDance
D
Dingling Zhang
ByteDance
Z
Zhongfei Hou
ByteDance
Z
Ziqiang Yang
ByteDance
C
Changxin Pu
ByteDance
F
Fei Hu
ByteDance
J
Jingkai Liu
Nanjing University
M
Mengyun Liu
Nanjing University
Y
Yang Liu
Nanjing University
X
Xiang Gao
Nanjing University
J
Jiaheng Liu
Nanjing University
T
Tong Yang
Peking University
Zaiyuan Wang
Zaiyuan Wang
ByteDance
AILLMFunction CallAgent
G
Ge Zhang
Beijing University of Posts and Telecommunications
W
Wenhao Huang
Beijing University of Posts and Telecommunications