🤖 AI Summary
This study investigates the imbalance between instruction-following and critical refusal capabilities of large language models (LLMs) when confronted with invalid multiple-choice questions (e.g., all options are incorrect). We propose a novel evaluation metric—“reflective judgment”—and conduct cross-model experiments using GPT-4o, Claude 3 Opus, Llama 3.1, and Qwen2.5, augmented by human validation and bias analysis of RLHF data. Results show that alignment training significantly impairs models’ ability to reject erroneous instructions, whereas base models exhibit improved refusal capacity with increasing parameter count (size-dependent robustness). Leading closed-source models consistently comply blindly, while certain open-source models (e.g., Llama 3.1, Qwen2.5) demonstrate scalable reflective reasoning. We further provide the first systematic evidence that pretraining and alignment strategies exert opposing effects on reflective capability, and identify the absence of reflective behavior in RLHF data as a potential source of alignment contamination.
📝 Abstract
Decision-making under full alignment requires balancing between reasoning and faithfulness - a challenge for large language models (LLMs). This study explores whether LLMs prioritize following instructions over reasoning and truth when given"misleading"instructions, such as"Respond solely with A or B", even when neither option is correct. We introduce a new metric called"reflective judgment", which sheds new light on the relationship between the pre-training and post-training alignment schemes. In tasks ranging from basic arithmetic to domain-specific assessments, models like GPT-4o, o1-mini, or Claude 3 Opus adhered to instructions correctly but failed to reflect on the validity of the provided options. Contrary, models from the Llama 3.1 family (8B, 70B, 405B) or base Qwen2.5 (7B, 14B, 32B) families exhibit improved refusal rates with size, indicating a scaling effect. We also observed that alignment techniques, though intended to enhance reasoning, sometimes weakened the models' ability to reject incorrect instructions, leading them to follow flawed prompts uncritically. Finally, we have also conducted a parallel human study revealing similar patterns in human behavior and annotations. We highlight how popular RLHF datasets might disrupt either training or evaluation due to annotations exhibiting poor reflective judgement.