Wait, that's not an option: LLMs Robustness with Incorrect Multiple-Choice Options

📅 2024-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates the imbalance between instruction-following and critical refusal capabilities of large language models (LLMs) when confronted with invalid multiple-choice questions (e.g., all options are incorrect). We propose a novel evaluation metric—“reflective judgment”—and conduct cross-model experiments using GPT-4o, Claude 3 Opus, Llama 3.1, and Qwen2.5, augmented by human validation and bias analysis of RLHF data. Results show that alignment training significantly impairs models’ ability to reject erroneous instructions, whereas base models exhibit improved refusal capacity with increasing parameter count (size-dependent robustness). Leading closed-source models consistently comply blindly, while certain open-source models (e.g., Llama 3.1, Qwen2.5) demonstrate scalable reflective reasoning. We further provide the first systematic evidence that pretraining and alignment strategies exert opposing effects on reflective capability, and identify the absence of reflective behavior in RLHF data as a potential source of alignment contamination.

Technology Category

Application Category

📝 Abstract
Decision-making under full alignment requires balancing between reasoning and faithfulness - a challenge for large language models (LLMs). This study explores whether LLMs prioritize following instructions over reasoning and truth when given"misleading"instructions, such as"Respond solely with A or B", even when neither option is correct. We introduce a new metric called"reflective judgment", which sheds new light on the relationship between the pre-training and post-training alignment schemes. In tasks ranging from basic arithmetic to domain-specific assessments, models like GPT-4o, o1-mini, or Claude 3 Opus adhered to instructions correctly but failed to reflect on the validity of the provided options. Contrary, models from the Llama 3.1 family (8B, 70B, 405B) or base Qwen2.5 (7B, 14B, 32B) families exhibit improved refusal rates with size, indicating a scaling effect. We also observed that alignment techniques, though intended to enhance reasoning, sometimes weakened the models' ability to reject incorrect instructions, leading them to follow flawed prompts uncritically. Finally, we have also conducted a parallel human study revealing similar patterns in human behavior and annotations. We highlight how popular RLHF datasets might disrupt either training or evaluation due to annotations exhibiting poor reflective judgement.
Problem

Research questions and friction points this paper is trying to address.

Evaluating LLMs' ability to handle invalid multiple-choice options
Assessing alignment techniques' impact on critical reasoning
Exploring human-like biases in instruction-following models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Framework evaluates LLMs with invalid multiple-choice options
Alignment techniques impair reflective judgment in models
Human study reveals similar instruction-following biases
🔎 Similar Papers
No similar papers found.
G
Gracjan G'oral
IDEAS NCBR, University of Warsaw
E
Emilia Wiśnios
IDEAS NCBR, University of Warsaw
P
Piotr Sankowski
IDEAS NCBR, MIM Solutions, University of Warsaw
P
Pawel Budzianowski
K-Scale Labs, University of Warsaw