Understanding Moral Reasoning Trajectories in Large Language Models: Toward Probing-Based Explainability

๐Ÿ“… 2026-03-16
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of interpretability in how large language models invoke ethical frameworks during morally sensitive decision-making. The authors introduce the concept of โ€œmoral reasoning trajectoriesโ€ and propose the MRC (Moral Reasoning Consistency) metric to systematically analyze dynamic ethical framework switching across multi-step reasoning. Using linear probing for localization, lightweight activation steering, KL divergence evaluation, and human annotation validation, they find that 55.4โ€“57.7% of reasoning steps involve framework shifts, with unstable trajectories exhibiting greater vulnerability to adversarial attacks. Probing-based interventions significantly reduce KL divergence, and MRC scores show strong correlation with human judgments (r = 0.715), revealing an intrinsic link among ethical representations, behavioral stability, and human moral assessment.

Technology Category

Application Category

๐Ÿ“ Abstract
Large language models (LLMs) increasingly participate in morally sensitive decision-making, yet how they organize ethical frameworks across reasoning steps remains underexplored. We introduce \textit{moral reasoning trajectories}, sequences of ethical framework invocations across intermediate reasoning steps, and analyze their dynamics across six models and three benchmarks. We find that moral reasoning involves systematic multi-framework deliberation: 55.4--57.7\% of consecutive steps involve framework switches, and only 16.4--17.8\% of trajectories remain framework-consistent. Unstable trajectories remain 1.29$\times$ more susceptible to persuasive attacks ($p=0.015$). At the representation level, linear probes localize framework-specific encoding to model-specific layers (layer 63/81 for Llama-3.3-70B; layer 17/81 for Qwen2.5-72B), achieving 13.8--22.6\% lower KL divergence than the training-set prior baseline. Lightweight activation steering modulates framework integration patterns (6.7--8.9\% drift reduction) and amplifies the stability--accuracy relationship. We further propose a Moral Representation Consistency (MRC) metric that correlates strongly ($r=0.715$, $p<0.0001$) with LLM coherence ratings, whose underlying framework attributions are validated by human annotators (mean cosine similarity $= 0.859$).
Problem

Research questions and friction points this paper is trying to address.

moral reasoning
large language models
ethical frameworks
reasoning trajectories
explainability
Innovation

Methods, ideas, or system contributions that make the work stand out.

moral reasoning trajectories
linear probing
activation steering
ethical framework
Moral Representation Consistency