🤖 AI Summary
This work addresses the susceptibility of large language models (LLMs) to semantic framing effects in high-stakes decision-making, where identical facts yield inconsistent choices solely due to phrasing differences. Drawing from behavioral psychology, the study is the first to disentangle the framing effect into three controllable dimensions and introduces Fragile, a large-scale benchmark for evaluating frame robustness. To mitigate this vulnerability, the authors propose Valign, a representation-level intervention that anchors decisions to stable value representations through value-prior guidance, directional control of hidden states, and projection onto sensitivity-aware subspaces. Experiments demonstrate that Valign substantially reduces frame-induced decision reversals, achieving significantly greater average reductions than existing prompt- or activation-based methods, thereby unifying value alignment with robustness to framing variations.
📝 Abstract
Large Language Models (LLMs) are increasingly deployed in high-stakes decision-making settings such as legal reasoning, where consistency under factually equivalent inputs is critical. However, we find that fact-preserved but differently framed inputs can significantly destabilize LLM decisions. To systematically investigate this problem, we introduce Fragile, a large-scale benchmark that isolates fact-preserving semantic framing across three controlled dimensions: value-tinted narration, temporal slice, and narrative vividness. Our experiments reveal a high susceptibility of LLMs to framing, with an average decision flip rate of 28.6%. We find that simple prior prompt-level and activation-level interventions not only fail to suppress framing sensitivity but actively amplify it. We therefore propose Valign, a representation-level method that explicitly targets these framing dimensions by anchoring decisions to a stable value prior, steering hidden states toward the model's value-consistent direction, and projecting out temporal-vividness-sensitive directions from the model's hidden states. Valign consistently reduces framing-induced decision flips, demonstrating that robust mitigation requires directly targeting the internal pathways in which framing operates.