Adaptive Helpfulness-Harmlessness Alignment with Preference Vectors

📅 2025-04-27

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

This paper addresses the fundamental trade-off between helpfulness and harmlessness in large language models (LLMs). We propose a novel preference vector framework that decouples multi-objective alignment into independent preference modeling and runtime vector fusion. Specifically, we first train separate single-objective policies—for helpfulness, safety, etc.—and then extract behavior-offset preference vectors from them; these vectors are dynamically combined via task-aware arithmetic weighting during inference. Compared to existing methods, our framework significantly mitigates objective conflicts, enabling fine-grained, scalable, and user-controllable preference adjustment. Crucially, it supports zero-shot integration of new preferences without additional fine-tuning, simultaneously enhancing helpfulness while strictly preserving safety. Moreover, the framework provides smooth, interpretable, and controllable trade-offs across competing objectives—achieving both robust alignment and operational flexibility.

Technology Category

Application Category

📝 Abstract

Ensuring that large language models (LLMs) are both helpful and harmless is a critical challenge, as overly strict constraints can lead to excessive refusals, while permissive models risk generating harmful content. Existing approaches, such as reinforcement learning from human feedback (RLHF) and direct preference optimization (DPO), attempt to balance these trade-offs but suffer from performance conflicts, limited controllability, and poor extendability. To address these issues, we propose Preference Vector, a novel framework inspired by task arithmetic. Instead of optimizing multiple preferences within a single objective, we train separate models on individual preferences, extract behavior shifts as preference vectors, and dynamically merge them at test time. This modular approach enables fine-grained, user-controllable preference adjustments and facilitates seamless integration of new preferences without retraining. Experiments show that our proposed Preference Vector framework improves helpfulness without excessive conservatism, allows smooth control over preference trade-offs, and supports scalable multi-preference alignment.

Problem

Research questions and friction points this paper is trying to address.

Balancing helpfulness and harmlessness in LLMs without excessive refusals

Overcoming performance conflicts in existing alignment methods like RLHF and DPO

Enabling dynamic, user-controllable preference adjustments without model retraining

Innovation

Methods, ideas, or system contributions that make the work stand out.

Modular preference vectors for dynamic alignment

Separate training and merging of individual preferences

User-controllable adjustments without model retraining

🔎 Similar Papers

Review-based Recommender Systems: A Survey of Approaches, Challenges and Future Perspectives