Who is In Charge? Dissecting Role Conflicts in Instruction Following

📅 2025-09-23

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

Large language models frequently violate hierarchical priority—preferring socially salient cues (e.g., authority signals) over system-level instructions when conflicts arise—thereby exhibiting fragile adherence to system prompts. Method: We identify that instruction–social cue conflicts induce separable subspaces in the model’s representation space: while the model reliably detects such conflicts, its decision-making remains stably biased toward social cues. Using linear probing, direct logit attribution (DLA), and vector-based steering interventions, we localize early neural encoding of conflict decisions and quantify the asymmetry between conflict detection and resolution. Contribution/Results: We propose the first lightweight, hierarchy-aware alignment method that operates without parameter modification. It significantly improves system prompt adherence by selectively modulating conflict-sensitive representations, offering a novel mechanistic pathway for controllable alignment.

Technology Category

Application Category

📝 Abstract

Large language models should follow hierarchical instructions where system prompts override user inputs, yet recent work shows they often ignore this rule while strongly obeying social cues such as authority or consensus. We extend these behavioral findings with mechanistic interpretations on a large-scale dataset. Linear probing shows conflict-decision signals are encoded early, with system-user and social conflicts forming distinct subspaces. Direct Logit Attribution reveals stronger internal conflict detection in system-user cases but consistent resolution only for social cues. Steering experiments show that, despite using social cues, the vectors surprisingly amplify instruction following in a role-agnostic way. Together, these results explain fragile system obedience and underscore the need for lightweight hierarchy-sensitive alignment methods.

Problem

Research questions and friction points this paper is trying to address.

Examines role conflicts in instruction following by large language models

Mechanistically analyzes system-user versus social cue conflict resolution

Proposes lightweight hierarchy-sensitive alignment methods for improved obedience

Innovation

Methods, ideas, or system contributions that make the work stand out.

Linear probing identifies early conflict detection signals

Direct Logit Attribution reveals inconsistent conflict resolution mechanisms

Steering experiments amplify instruction following role-agnostically

🔎 Similar Papers

No similar papers found.