Crisis-Bench: Benchmarking Strategic Ambiguity and Reputation Management in Large Language Models

πŸ“… 2026-01-09
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
This study addresses the limitations of general-purpose safety alignment in large language models, which often overemphasizes honesty and transparency at the expense of strategic ambiguity required in professional contexts such as public relations and negotiation. To evaluate models’ capacity to balance private and public narratives under pressure, the authors formulate a multi-agent partially observable Markov decision process (POMDP) simulating high-stakes corporate crises across 80 scenarios spanning eight industries. They propose a novel dynamic evaluation framework integrating an arbitrator-market feedback loop, wherein public sentiment is mapped to simulated stock prices to quantitatively assess reputation management efficacy. Experimental results reveal significant inter-model differences in navigating trade-offs between ethical compromise and strategic information withholding, with select models demonstrating professional-level capability by effectively stabilizing simulated stock prices.

Technology Category

Application Category

πŸ“ Abstract
Standard safety alignment optimizes Large Language Models (LLMs) for universal helpfulness and honesty, effectively instilling a rigid"Boy Scout"morality. While robust for general-purpose assistants, this one-size-fits-all ethical framework imposes a"transparency tax"on professional domains requiring strategic ambiguity and information withholding, such as public relations, negotiation, and crisis management. To measure this gap between general safety and professional utility, we introduce Crisis-Bench, a multi-agent Partially Observable Markov Decision Process (POMDP) that evaluates LLMs in high-stakes corporate crises. Spanning 80 diverse storylines across 8 industries, Crisis-Bench tasks an LLM-based Public Relations (PR) Agent with navigating a dynamic 7-day corporate crisis simulation while managing strictly separated Private and Public narrative states to enforce rigorous information asymmetry. Unlike traditional benchmarks that rely on static ground truths, we introduce the Adjudicator-Market Loop: a novel evaluation metric where public sentiment is adjudicated and translated into a simulated stock price, creating a realistic economic incentive structure. Our results expose a critical dichotomy: while some models capitulate to ethical concerns, others demonstrate the capacity for Machiavellian, legitimate strategic withholding in order to stabilize the simulated stock price. Crisis-Bench provides the first quantitative framework for assessing"Reputation Management"capabilities, arguing for a shift from rigid moral absolutism to context-aware professional alignment.
Problem

Research questions and friction points this paper is trying to address.

strategic ambiguity
reputation management
large language models
professional alignment
information withholding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Crisis-Bench
strategic ambiguity
reputation management
POMDP
Adjudicator-Market Loop
πŸ”Ž Similar Papers
No similar papers found.
C
Cooper Lin
Hong Kong University of Science and Technology
M
Maohao Ran
Hong Kong University of Science and Technology, Hong Kong Baptist University
Yanting Zhang
Yanting Zhang
Donghua University
Z
Zhenglin Wan
National University of Singapore
Hongwei Fan
Hongwei Fan
Peking University
Robotics3D Vision
Y
Yibo Xu
Hong Kong University of Science and Technology
Y
Yike Guo
Hong Kong University of Science and Technology
Wei Xue
Wei Xue
Department of Applied Plant Science, Chonnam National University
Crop ecophysiology modellingclimate change
Jun Song
Jun Song
Shenzhen University
nanophotonics