Palette: A Modular, Controllable, and Efficient Framework for On-demand Authorized Safety Alignment Relaxation in LLMs

📅 2026-05-22

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

This work addresses the limitation of current large language and vision-language models, which employ uniform safety alignment strategies that fail to distinguish legitimate requests from authorized professionals versus those from general users, often leading to excessive refusals and reduced utility in specialized domains. To overcome this, the authors propose a modular, controllable, and efficient authorized safety alignment framework. By leveraging multi-objective search to identify refusal directions and integrating lightweight adaptation with parameter fusion techniques, the approach enables on-demand, multi-domain authorization without retraining. It selectively relaxes safety constraints in designated professional contexts while preserving standard safeguards elsewhere. Evaluated across four safety benchmarks and multiple model architectures, the method significantly enhances usability in expert scenarios without compromising general-purpose capabilities.

📝 Abstract

Current safety alignment of foundation models largely follows a \emph{one-size-fits-all} paradigm, applying the same refusal policy across users and contexts. As a result, models may refuse requests that are unsafe for general users but legitimate for authorized professionals, limiting helpfulness in specialized professional settings. Existing approaches either require costly realignment or rely on inference-time steering that suffers from imprecise control and added latency. To this end, we propose \textsc{Palette}, a modular, controllable, and efficient framework that selectively relaxes refusal behavior on authorized target domains while preserving standard safety elsewhere. Our method identifies a refusal direction via multi-objective search and internalizes it into the model through lightweight adaptation. \textsc{Palette} further supports modular composition: it learns domain-specific safety controls independently and composes them through parameter merging, enabling on-demand multi-domain authorization without retraining. Experiments across four safety benchmarks, multiple model variants, and both LLMs and VLMs show that \textsc{Palette} delivers precise safety control without sacrificing general utility, offering a practical path toward foundation models that adapt to diverse professional needs.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

authorized users

refusal policy

professional settings

on-demand relaxation

Innovation

Methods, ideas, or system contributions that make the work stand out.

safety alignment relaxation

modular control

refusal direction