MOSAIC: Composable Safety Alignment with Modular Control Tokens

πŸ“… 2026-03-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing safety alignment methods for large language models struggle to accommodate the diverse needs of users, regions, and scenarios: parameter-level alignment often compromises general capabilities, while prompt-based approaches provide insufficient control. To address this, this work proposes MOSAIC, a framework that introduces learnable, modular control tokens atop a frozen backbone model, enabling flexible composition and context-aware safety policies during inference. By leveraging sequential task sampling and distribution-level alignment objectives, MOSAIC effectively mitigates over-rejection issues. Experimental results demonstrate that MOSAIC significantly enhances safety defenses while preserving the model’s general capabilities and substantially reducing false rejection rates.

Technology Category

Application Category

πŸ“ Abstract
Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.
Problem

Research questions and friction points this paper is trying to address.

safety alignment
context-dependent safety
large language models
conditional control
over-refusal
Innovation

Methods, ideas, or system contributions that make the work stand out.

modular control tokens
composable safety alignment
frozen backbone model
distribution-level alignment
over-refusal mitigation