🤖 AI Summary
This work addresses the degradation in task utility and behavioral inconsistency that large language models often exhibit when processing multiple instructions from heterogeneous authority levels—such as system policies, user requests, and tool outputs—due to benign yet pervasive implicit conflicts. The authors propose Neural-Symbolic Hierarchical Alignment (NSHA), which models instruction parsing during inference as a prioritized constraint satisfaction problem, leveraging a solver to select a maximally consistent subset of instructions. During training, the method distills the solver’s decisions into the model parameters using automatically generated supervision signals. NSHA is the first approach to jointly integrate logical consistency with hierarchical instruction prioritization, achieving significant improvements over baselines in rule adherence, task execution, tool usage, and safety, while maintaining competitive performance on standard benchmarks.
📝 Abstract
Large language models increasingly operate under multiple instructions from heterogeneous sources with different authority levels, including system policies, user requests, tool outputs, and retrieved context. While prior work on instruction hierarchy highlights the importance of respecting instruction priorities, it mainly focuses on adversarial attacks and overlooks the benign but common instruction conflicts that arise in real-world applications. In such settings, models must not only avoid security violations but also preserve task utility and behavioral consistency when instructions partially or implicitly conflict. We propose Neuro-Symbolic Hierarchical Alignment (NSHA) for hierarchical instruction-following by explicitly modeling and enforcing instruction priorities. At inference time, we introduce solver-guided reasoning that formulates instruction resolution as a constraint satisfaction problem, enabling the model to derive a maximally consistent set of applicable instructions under hierarchical constraints. At training time, NSHA distills solver-based decisions into model parameters using automatically constructed supervision. We evaluate our approach on rule following, task execution, tool use, and safety, covering both single-turn and multi-turn interactions, and show that NSHA significantly improves performance under such conflicts while maintaining competitive utility in reference settings.