Where Paths Split: Localized, Calibrated Control of Moral Reasoning in Large Language Models

📅 2026-05-05

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

Large language models often exhibit inconsistent ethical preferences in moral decision-making, struggling to align with specific ethical frameworks while preserving general capabilities. This work proposes a fine-grained intervention method that identifies critical branching points in Transformer-based models—where ethical reasoning paths first converge and subsequently diverge—via Convergent-Divergent Routing. It leverages Common Spatial Patterns to extract discriminative directional features and introduces Dual Logit Calibration to achieve closed-form, minimum-norm preference calibration. Evaluated on real-world moral dilemma tasks, the approach significantly outperforms existing baselines, efficiently steering models to adhere to user-specified ethical principles such as utilitarianism or deontology, while largely retaining their general-purpose competence. The method offers strong interpretability and computational efficiency, providing a principled pathway for controllable ethical alignment.

📝 Abstract

Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-$\ell_2$-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.

Problem

Research questions and friction points this paper is trying to address.

moral reasoning

large language models

ethical frameworks

preference calibration

inference-time control

Innovation

Methods, ideas, or system contributions that make the work stand out.

Convergent-Divergent Routing

Dual Logit Calibration

moral reasoning control