🤖 AI Summary
Large language models (LLMs) face critical safety risks, including harmful content generation and jailbreaking attacks; existing mitigation strategies struggle to balance safety, utility, and controllability. To address this, we propose UpSafe$^circ$C—a modular, inference-time controllable, and dynamically activated safety enhancement framework. Our key contributions are: (1) the first introduction of a *safety temperature* mechanism, enabling fine-grained, Pareto-optimal trade-offs between safety and utility; (2) identification of safety-critical transformer layers and their restructuring into sparse Mixture-of-Experts (MoE) architectures, where soft-gated routing dynamically activates dedicated safety experts; and (3) a two-stage supervised fine-tuning procedure to strengthen safety discrimination. Extensive experiments across multiple benchmarks and model scales demonstrate that UpSafe$^circ$C significantly improves robustness against harmful outputs and jailbreaking attempts while preserving strong performance on general-purpose tasks—achieving superior safety, utility, and controllability.
📝 Abstract
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.