Moral Preferences of LLMs Under Directed Contextual Influence

📅 2026-02-26

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This study addresses the limitation of current moral evaluations of large language models, which predominantly rely on context-free prompts and overlook how directive contextual cues—such as user requests or social norms—influence ethical decision-making in real-world applications. The authors propose a novel evaluation framework that systematically measures model decision shifts under opposing yet content-matched contextual prompts within trolley-problem-inspired moral triage scenarios. By innovatively incorporating a direction-flipped context manipulation mechanism alongside controlled prompt design, few-shot example engineering, and behavioral analysis, the work uncovers a systematic, steerability-induced asymmetry beneath apparent model neutrality. Key findings reveal that contextual cues substantially alter model decisions; baseline preferences fail to predict steerability; specific contexts can induce reversed biases; and while reasoning reduces average sensitivity, it amplifies the influence of biased examples.

Technology Category

Application Category

📝 Abstract

Moral benchmarks for LLMs typically use context-free prompts, implicitly assuming stable preferences. In deployment, however, prompts routinely include contextual signals such as user requests, cues on social norms, etc. that may steer decisions. We study how directed contextual influences reshape decisions in trolley-problem-style moral triage settings. We introduce a pilot evaluation harness for directed contextual influence in trolley-problem-style moral triage: for each demographic factor, we apply matched, direction-flipped contextual influences that differ only in which group they favor, enabling systematic measurement of directional response. We find that: (i) contextual influences often significantly shift decisions, even when only superficially relevant; (ii) baseline preferences are a poor predictor of directional steerability, as models can appear baseline-neutral yet exhibit systematic steerability asymmetry under influence; (iii) influences can backfire: models may explicitly claim neutrality or discount the contextual cue, yet their choices still shift, sometimes in the opposite direction; and (iv) reasoning reduces average sensitivity, but amplifies the effect of biased few-shot examples. Our findings motivate extending moral evaluations with controlled, direction-flipped context manipulations to better characterize model behavior.

Problem

Research questions and friction points this paper is trying to address.

moral preferences

large language models

contextual influence

trolley problem

moral evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

directed contextual influence

moral triage

direction-flipped prompts