We Think, Therefore We Align LLMs to Helpful, Harmless and Honest Before They Go Wrong

πŸ“… 2025-09-26
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Addressing the dual challenges of representational forgetting and output inconsistency in multi-objective alignment (helpfulness, harmlessness, honestyβ€”HHH) of large language models (LLMs), this paper proposes Adaptive Multi-Branch Steering (AMBS). AMBS employs a two-stage 1-to-N framework: in Stage I, shared hidden states across post-attention Transformer layers construct a unified semantic representation; in Stage II, a policy-reference dual-path steering mechanism enables parallel, target-specific optimization while enforcing cross-objective consistency. This design balances objective independence with inference unification, effectively mitigating catastrophic forgetting and inference fragmentation. Experiments on Alpaca, BeaverTails, and TruthfulQA demonstrate that AMBS achieves an average 32.4% improvement in alignment scores and reduces unsafe outputs by 11.0%, significantly outperforming state-of-the-art methods.

Technology Category

Application Category

πŸ“ Abstract
Alignment of Large Language Models (LLMs) along multiple objectives-helpfulness, harmlessness, and honesty (HHH)-is critical for safe and reliable deployment. Prior work has used steering vector-small control signals injected into hidden states-to guide LLM outputs, typically via one-to-one (1-to-1) Transformer decoders. In this setting, optimizing a single alignment objective can inadvertently overwrite representations learned for other objectives, leading to catastrophic forgetting. More recent approaches extend steering vectors via one-to-many (1-to-N) Transformer decoders. While this alleviates catastrophic forgetting, naive multi-branch designs optimize each objective independently, which can cause inference fragmentation-outputs across HHH objectives may become inconsistent. We propose Adaptive Multi-Branch Steering (AMBS), a two-stage 1-to-N framework for unified and efficient multi-objective alignment. In Stage I, post-attention hidden states of the Transformer layer are computed once to form a shared representation. In Stage II, this representation is cloned into parallel branches and steered via a policy-reference mechanism, enabling objective-specific control while maintaining cross-objective consistency. Empirical evaluations on Alpaca, BeaverTails, and TruthfulQA show that AMBS consistently improves HHH alignment across multiple 7B LLM backbones. For example, on DeepSeek-7B, AMBS improves average alignment scores by +32.4% and reduces unsafe outputs by 11.0% compared to a naive 1-to-N baseline, while remaining competitive with state-of-the-art methods.
Problem

Research questions and friction points this paper is trying to address.

Preventing catastrophic forgetting in multi-objective LLM alignment
Addressing inference fragmentation across helpful, harmless, honest objectives
Enabling unified steering while maintaining cross-objective consistency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage framework for multi-objective alignment
Shared representation cloned into parallel branches
Policy-reference mechanism maintains cross-objective consistency
πŸ”Ž Similar Papers
No similar papers found.