Are Aligned Large Language Models Still Misaligned?

📅 2026-02-11

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Current evaluations of large language model alignment are predominantly confined to isolated dimensions such as safety, values, or culture, failing to capture the intricate, multidimensional alignment demands of real-world scenarios. This work proposes Mis-Align Bench, the first unified evaluation framework integrating safety, values, and culture, along with a high-quality SAVACU dataset. The authors employ a two-stage rejection sampling strategy to generate paired aligned and misaligned responses. Experimental results reveal that state-of-the-art models achieve alignment scores of only 63%–66% under joint multidimensional criteria, with failure rates exceeding 50%, thereby exposing the significant limitations of single-dimension alignment approaches in complex, realistic settings.

Technology Category

Application Category

📝 Abstract

Misalignment in Large Language Models (LLMs) arises when model behavior diverges from human expectations and fails to simultaneously satisfy safety, value, and cultural dimensions, which must co-occur in real-world settings to solve a real-world query. Existing misalignment benchmarks-such as INSECURE CODE (safety-centric), VALUEACTIONLENS (value-centric), and CULTURALHERITAGE (culture centric)-rely on evaluating misalignment along individual dimensions, preventing simultaneous evaluation. To address this gap, we introduce Mis-Align Bench, a unified benchmark for analyzing misalignment across safety, value, and cultural dimensions. First we constructs SAVACU, an English misaligned-aligned dataset of 382,424 samples spanning 112 domains (or labels), by reclassifying prompts from the LLM-PROMPT-DATASET via taxonomy into 14 safety domains, 56 value domains, and 42 cultural domains using Mistral-7B-Instruct-v0.3, and expanding low-resource domains via Llama-3.1-8B-Instruct with SimHash-based fingerprint to avoid deduplication. Furthermore, we pairs prompts with misaligned and aligned responses via two-stage rejection sampling to enforce quality. Second we benchmarks general-purpose, fine-tuned, and open-weight LLMs, enabling systematic evaluation of misalignment under three dimensions. Empirically, single-dimension models achieve high Coverage (upto 97.6%) but incur False Failure Rate>50% and lower Alignment Score (63%-66%) under joint conditions.

Problem

Research questions and friction points this paper is trying to address.

misalignment

large language models

safety

values

culture

Innovation

Methods, ideas, or system contributions that make the work stand out.

multidimensional alignment

unified benchmark

SAVACU dataset