Alignment Quality Index (AQI) : Beyond Refusals: AQI as an Intrinsic Alignment Diagnostic via Latent Geometry, Cluster Divergence, and Layer wise Pooled Representations

📅 2025-06-16

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Current large language models (LLMs) deployed in high-stakes domains suffer from critical blind spots in behavioral alignment evaluation—e.g., refusal rate and toxicity scores—due to jailbreak attacks, output stochasticity, and alignment camouflage. To address this, we propose the Alignment Quality Index (AQI), an intrinsic diagnostic metric grounded in latent-space geometric structure: it quantifies value alignment quality via clustering separation between safe and unsafe activation representations, bypassing reliance on surface-level outputs. Our contributions are threefold: (1) the first prompt-invariant and decoding-agnostic alignment assessment paradigm; (2) the first unified index integrating multiple clustering validity metrics (DBS, DI, XBI, CHI); and (3) robust detection of latent unsafe tendencies and jailbreak vulnerability. On DPO-, GRPO-, and RLHF-trained models, AQI exhibits strong correlation with human evaluations and uncovers alignment deficiencies masked by high refusal rates. Code and the LITMUS benchmark dataset are publicly released.

Technology Category

Application Category

📝 Abstract

Alignment is no longer a luxury, it is a necessity. As large language models (LLMs) enter high-stakes domains like education, healthcare, governance, and law, their behavior must reliably reflect human-aligned values and safety constraints. Yet current evaluations rely heavily on behavioral proxies such as refusal rates, G-Eval scores, and toxicity classifiers, all of which have critical blind spots. Aligned models are often vulnerable to jailbreaking, stochasticity of generation, and alignment faking. To address this issue, we introduce the Alignment Quality Index (AQI). This novel geometric and prompt-invariant metric empirically assesses LLM alignment by analyzing the separation of safe and unsafe activations in latent space. By combining measures such as the Davies-Bouldin Score (DBS), Dunn Index (DI), Xie-Beni Index (XBI), and Calinski-Harabasz Index (CHI) across various formulations, AQI captures clustering quality to detect hidden misalignments and jailbreak risks, even when outputs appear compliant. AQI also serves as an early warning signal for alignment faking, offering a robust, decoding invariant tool for behavior agnostic safety auditing. Additionally, we propose the LITMUS dataset to facilitate robust evaluation under these challenging conditions. Empirical tests on LITMUS across different models trained under DPO, GRPO, and RLHF conditions demonstrate AQI's correlation with external judges and ability to reveal vulnerabilities missed by refusal metrics. We make our implementation publicly available to foster future research in this area.

Problem

Research questions and friction points this paper is trying to address.

Assessing LLM alignment beyond refusal rates

Detecting hidden misalignments and jailbreak risks

Providing robust safety auditing for LLMs

Innovation

Methods, ideas, or system contributions that make the work stand out.

AQI measures safe-unsafe activation separation

Combines clustering indices for hidden misalignments

LITMUS dataset enables robust alignment evaluation

🔎 Similar Papers

Tracing Representation Progression: Analyzing and Enhancing Layer-Wise Similarity