MOSAIC: Composable Safety Alignment with Modular Control Tokens

📅 2026-03-17

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Existing safety alignment methods for large language models struggle to accommodate the diverse needs of users, regions, and scenarios: parameter-level alignment often compromises general capabilities, while prompt-based approaches provide insufficient control. To address this, this work proposes MOSAIC, a framework that introduces learnable, modular control tokens atop a frozen backbone model, enabling flexible composition and context-aware safety policies during inference. By leveraging sequential task sampling and distribution-level alignment objectives, MOSAIC effectively mitigates over-rejection issues. Experimental results demonstrate that MOSAIC significantly enhances safety defenses while preserving the model’s general capabilities and substantially reducing false rejection rates.

Technology Category

Application Category

📝 Abstract

Safety alignment in large language models (LLMs) is commonly implemented as a single static policy embedded in model parameters. However, real-world deployments often require context-dependent safety rules that vary across users, regions, and applications. Existing approaches struggle to provide such conditional control: parameter-level alignment entangles safety behaviors with general capabilities, while prompt-based methods rely on natural language instructions that provide weak enforcement. We propose MOSAIC, a modular framework that enables compositional safety alignment through learnable control tokens optimized over a frozen backbone model. Each token represents a safety constraint and can be flexibly activated and composed at inference time. To train compositional tokens efficiently, we introduce order-based task sampling and a distribution-level alignment objective that mitigates over-refusal. Experiments show that MOSAIC achieves strong defense performance with substantially lower over-refusal while preserving model utility.

Problem

Research questions and friction points this paper is trying to address.

safety alignment

context-dependent safety

large language models

conditional control

over-refusal

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular control tokens

composable safety alignment

frozen backbone model

distribution-level alignment

over-refusal mitigation

🔎 Similar Papers

A Language for Smart Contracts with Secure Control Flow (Technical Report)

2024-07-01Citations: 0

Authors to Follow