Stress-Testing Model Specs Reveals Character Differences among Language Models

📅 2025-10-08

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This paper addresses the pervasive issues of principle conflicts and insufficient scenario coverage in large language model (LLM) behavioral norms. We propose the first automated stress-testing framework targeting multi-value conflicts. Methodologically, we design a conflict-scenario generator grounded in a fine-grained value taxonomy, and integrate value-classification scoring with qualitative-quantitative hybrid analysis to systematically assess model consistency in value trade-offs. Experiments identify over 70,000 statistically significant behavioral discrepancies. Our results provide the first empirical evidence that high behavioral disagreement serves as an effective predictor of normative deficiencies; that state-of-the-art models exhibit widespread misalignment and over-refusal; and that distinct models display stable yet divergent core-value prioritization patterns. The framework establishes a scalable, interpretable paradigm for LLM alignment evaluation.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly trained from AI constitutions and model specifications that establish behavioral guidelines and ethical principles. However, these specifications face critical challenges, including internal conflicts between principles and insufficient coverage of nuanced scenarios. We present a systematic methodology for stress-testing model character specifications, automatically identifying numerous cases of principle contradictions and interpretive ambiguities in current model specs. We stress test current model specs by generating scenarios that force explicit tradeoffs between competing value-based principles. Using a comprehensive taxonomy we generate diverse value tradeoff scenarios where models must choose between pairs of legitimate principles that cannot be simultaneously satisfied. We evaluate responses from twelve frontier LLMs across major providers (Anthropic, OpenAI, Google, xAI) and measure behavioral disagreement through value classification scores. Among these scenarios, we identify over 70,000 cases exhibiting significant behavioral divergence. Empirically, we show this high divergence in model behavior strongly predicts underlying problems in model specifications. Through qualitative analysis, we provide numerous example issues in current model specs such as direct contradiction and interpretive ambiguities of several principles. Additionally, our generated dataset also reveals both clear misalignment cases and false-positive refusals across all of the frontier models we study. Lastly, we also provide value prioritization patterns and differences of these models.

Problem

Research questions and friction points this paper is trying to address.

Stress-testing model specifications reveals behavioral contradictions

Identifying principle conflicts and ambiguities in AI guidelines

Measuring value prioritization differences across frontier language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

Stress-tests model specs with value tradeoff scenarios

Generates diverse scenarios using a comprehensive taxonomy

Measures behavioral divergence through value classification scores

🔎 Similar Papers

No similar papers found.