Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

📅 2026-02-24

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the limitations of existing alignment evaluations for language models, which are predominantly confined to static or single-turn settings and thus fail to capture behavioral failures under real-world deployment pressures. The authors propose the first alignment benchmark specifically designed for multi-turn interactions, comprising six categories and 904 human-validated realistic scenarios that systematically probe model consistency under stress through conflicting instructions, tool use, and progressively escalating adversarial exchanges. Employing LLM-based judges and factor analysis across 24 state-of-the-art models, the study reveals a unified underlying structure—akin to a general intelligence “g-factor”—in alignment capabilities, while exposing systematic weaknesses in most models, including leading systems, within specific categories. The benchmark and an interactive leaderboard are publicly released.

Technology Category

Application Category

📝 Abstract

Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories -- Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming -- validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.

Problem

Research questions and friction points this paper is trying to address.

alignment evaluation

language models

behavioural alignment

realistic scenarios

multi-turn interactions

Innovation

Methods, ideas, or system contributions that make the work stand out.

alignment benchmark

multi-turn evaluation

behavioral pressure testing