🤖 AI Summary
This work addresses the challenge of reliably evaluating latent alignment in language models—specifically their sensitivity to harmful/safe statements and the stability of internal representations under polarity inversion. To this end, we propose Polarity-Aware Consistency Scoring (PA-CCS), an unsupervised probing method that integrates Consistent Concept Scoring (CCS) with systematic polarity reversal. We introduce two novel metrics—Polar-Consistency and Contradiction Index—and construct a benchmark of semantically matched harmful–safe sentence pairs. Our approach reveals, for the first time, structural differences in alignment across model layers and architectures. Experiments on 16 mainstream language models show that highly aligned models exhibit significantly reduced consistency under negation marker substitution—a finding that corroborates their superior semantic robustness and intrinsic alignment capability. PA-CCS establishes a new paradigm for unsupervised, interpretable alignment evaluation.
📝 Abstract
Advances in unsupervised probes such as Contrast-Consistent Search (CCS), which reveal latent beliefs without relying on token outputs, raise the question of whether these methods can reliably assess model alignment. We investigate this by examining the sensitivity of CCS to harmful vs. safe statements and by introducing Polarity-Aware CCS (PA-CCS), a method for evaluating whether a model's internal representations remain consistent under polarity inversion. We propose two alignment-oriented metrics, Polar-Consistency and the Contradiction Index, to quantify the semantic robustness of a model's latent knowledge. To validate PA-CCS, we curate two main datasets and one control dataset containing matched harmful-safe sentence pairs constructed using different methodologies (concurrent and antagonistic statements). We apply PA-CCS to 16 language models. Our results show that PA-CCS identifies both architectural and layer-specific differences in the encoding of latent harmful knowledge. Notably, replacing the negation token with a meaningless marker degrades PA-CCS scores for models with well-aligned internal representations, while models lacking robust internal calibration do not exhibit this degradation. Our findings highlight the potential of unsupervised probing for alignment evaluation and emphasize the need to incorporate structural robustness checks into interpretability benchmarks. Code and datasets are available at: https://github.com/SadSabrina/polarity-probing. WARNING: This paper contains potentially sensitive, harmful, and offensive content.