Implicit Safety Alignment from Crowd Preferences

📅 2026-05-20

📈 Citations: 0

✨ Influential: 0

career value

235K/year

🤖 AI Summary

This work addresses the challenge of extracting implicit, generalizable safety principles from crowdsourced preference data in the absence of explicit safety rewards and transferring them to downstream reinforcement learning tasks. The authors propose Safe Crowd Preference-based RL, a novel framework that learns safety-aligned latent skills by identifying commonalities across diverse user preferences and composes these skills via a high-level policy to accomplish tasks. Integrating human feedback reinforcement learning, reward modeling, and hierarchical skill abstraction, the method significantly reduces safety violations across multiple safety-critical environments and LLM-like tasks, achieving performance comparable to oracle approaches that rely on ground-truth safety signals.

📝 Abstract

Reinforcement Learning from Human Feedback (RLHF) can reveal implicit objectives such as safety considerations that go beyond task completion. In this work, we focus on the common safety criteria embedded in crowd preference datasets, where different users may express distinct preferences or objectives, yet follow similar safety principles. Our aim is to discover shared safety criteria from crowd preferences and then transfer them to downstream RL tasks to regularize agent behavior and enforce safety. We first show that direct reward combination-optimizing a preference-learned reward model together with downstream task rewards-has inherent limitations. Motivated by this, we propose Safe Crowd Preference-based RL, a hierarchical framework that extracts safety-aligned skills from crowd preferences and composes them via a high-level policy to safely solve downstream tasks. Experiments across safe RL environments and a preliminary LLM-style task with diverse user goals and shared safety constraints demonstrate that our approach substantially lowers safety costs without access to explicit safety rewards, while achieving task performance comparable to oracle methods trained with ground-truth safety signals.

Problem

Research questions and friction points this paper is trying to address.

Implicit Safety Alignment

Crowd Preferences

Reinforcement Learning from Human Feedback

Safety Criteria

Safe RL

Innovation

Methods, ideas, or system contributions that make the work stand out.

Implicit Safety Alignment

Crowd Preferences

Reinforcement Learning from Human Feedback