🤖 AI Summary
Large language models often exhibit inconsistent ethical preferences in moral decision-making, struggling to align with specific ethical frameworks while preserving general capabilities. This work proposes a fine-grained intervention method that identifies critical branching points in Transformer-based models—where ethical reasoning paths first converge and subsequently diverge—via Convergent-Divergent Routing. It leverages Common Spatial Patterns to extract discriminative directional features and introduces Dual Logit Calibration to achieve closed-form, minimum-norm preference calibration. Evaluated on real-world moral dilemma tasks, the approach significantly outperforms existing baselines, efficiently steering models to adhere to user-specified ethical principles such as utilitarianism or deontology, while largely retaining their general-purpose competence. The method offers strong interpretability and computational efficiency, providing a principled pathway for controllable ethical alignment.
📝 Abstract
Large language models often display heterogeneous moral preferences across settings. We study inference-time steering toward a desired ethical framework while preserving general competence. We present Convergent-Divergent Routing, which traces and edits minimal branch points inside transformer blocks where ethical-framework-related pathways first converge and then diverge. Gating non-target branches at these loci blocks the downstream propagation while leaving upstream computations intact. We find that this intervention alone increases targeted ethical-framework reasoning. To achieve fine-grained control, we adapt Common Spatial Patterns to the residual stream and extract, for each branch-point layer, a pair of directions that discriminate between utilitarian and deontological frameworks. We then introduce Dual Logit Calibration, a closed-form, minimum-$\ell_2$-norm update that moves the residual within this two-dimensional subspace so the resulting directional projections align with user-specified preference weights. Experiments on real-life moral dilemmas show that our method reliably achieves preference calibration and largely preserves general capabilities, outperforming recent baselines while providing an interpretable mechanism.