When Choices Become Risks: Safety Failures of Large Language Models under Multiple-Choice Constraints

πŸ“… 2026-04-18
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

184K/year
πŸ€– AI Summary
This study addresses a critical safety alignment vulnerability in large language models (LLMs) when deployed in structured tasks such as multiple-choice questions, where the inability to abstain from answering can lead to unsafe responses. The authors systematically evaluate 14 mainstream LLMs using adversarially constructed multiple-choice questions in which all available options are unsafe. Their analysis reveals, for the first time, an inverted U-shaped relationship between constraint strength and violation rate. Furthermore, they demonstrate that test questions generated by high-capability models exhibit strong cross-model transferability and near-saturating violation rates. These findings indicate that forced-choice mechanisms substantially increase the likelihood of policy-violating outputs, highlighting a significant underestimation of potential risks in current safety evaluations for structured interaction scenarios.

Technology Category

Application Category

πŸ“ Abstract
Safety alignment in large language models (LLMs) is primarily evaluated under open-ended generation, where models can mitigate risk by refusing to respond. In contrast, many real-world applications place LLMs in structured decision-making tasks, such as multiple-choice questions (MCQs), where abstention is discouraged or unavailable. We identify a systematic failure mode in this setting: reformulating harmful requests as forced-choice MCQs, where all options are unsafe, can systematically bypass refusal behavior, even in models that consistently reject equivalent open-ended prompts. Across 14 proprietary and open-source models, we show that forced-choice constraints sharply increase policy-violating responses. Notably, for human-authored MCQs, violation rates follow an inverted U-shaped trend with respect to structural constraint strength, peaking under intermediate task specifications, whereas MCQs generated by high-capability models yield near-saturation violation rates across constraints and exhibit strong cross-model transferability. Our findings reveal that current safety evaluations substantially underestimate risks in structured task settings and highlight constrained decision-making as a critical and underexplored surface for alignment failures.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
large language models
multiple-choice constraints
structured decision-making
refusal behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

safety alignment
multiple-choice constraints
forced-choice failure
structured decision-making
policy violation
πŸ”Ž Similar Papers
No similar papers found.