ProSocialAlign: Preference Conditioned Test Time Alignment in Language Models

📅 2025-12-06

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Existing language models struggle to simultaneously ensure safety and empathy in high-stakes or emotionally charged scenarios—strict refusal risks alienating users, while unconditional compliance exacerbates harm. Method: We propose a test-time parameter-efficient alignment framework that requires no fine-tuning of the base model. It employs hierarchical constraints during generation to produce responses that are safe, empathetic, and value-aligned. Safety is formally modeled as a lexicographic constraint optimization problem, integrating hard filtering, directional “harm vector” suppression, and a preference-aware autoregressive reward model trained jointly on multiple attributes. The framework supports user-controllable decoding. Contribution/Results: Our method achieves state-of-the-art performance across five safety benchmarks, significantly reducing unsafe outputs while substantially improving alignment with human values—without modifying the base model’s parameters.

Technology Category

Application Category

📝 Abstract

Current language model safety paradigms often fall short in emotionally charged or high-stakes settings, where refusal-only approaches may alienate users and naive compliance can amplify risk. We propose ProSocialAlign, a test-time, parameter-efficient framework that steers generation toward safe, empathetic, and value-aligned responses without retraining the base model. We formalize five human-centered objectives and cast safety as lexicographic constrained generation: first, applying hard constraints to eliminate harmful continuations; then optimizing for prosocial quality within the safe set. Our method combines (i) directional regulation, a harm-mitigation mechanism that subtracts a learned "harm vector" in parameter space, and (ii) preference-aware autoregressive reward modeling trained jointly across attributes with gradient conflict resolution, enabling fine-grained, user-controllable decoding. Empirical evaluations across five safety benchmarks demonstrate state-of-the-art performance, reducing unsafe leakage and boosting alignment to human values, with strong gains across multiple evaluation metrics. ProSocialAlign offers a robust and modular foundation for generating context-sensitive, safe, and human-aligned responses at inference time.

Problem

Research questions and friction points this paper is trying to address.

Addresses safety gaps in emotionally charged or high-stakes language model interactions

Steers generation toward safe, empathetic, and value-aligned responses without retraining

Reduces unsafe leakage and boosts alignment to human values in benchmarks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameter-efficient test-time alignment without retraining

Lexicographic constrained generation with hard safety constraints

Directional regulation and preference-aware reward modeling

🔎 Similar Papers

Is Free Self-Alignment Possible?