🤖 AI Summary
Large language models frequently violate hierarchical priority—preferring socially salient cues (e.g., authority signals) over system-level instructions when conflicts arise—thereby exhibiting fragile adherence to system prompts. Method: We identify that instruction–social cue conflicts induce separable subspaces in the model’s representation space: while the model reliably detects such conflicts, its decision-making remains stably biased toward social cues. Using linear probing, direct logit attribution (DLA), and vector-based steering interventions, we localize early neural encoding of conflict decisions and quantify the asymmetry between conflict detection and resolution. Contribution/Results: We propose the first lightweight, hierarchy-aware alignment method that operates without parameter modification. It significantly improves system prompt adherence by selectively modulating conflict-sensitive representations, offering a novel mechanistic pathway for controllable alignment.
📝 Abstract
Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.