UpSafe$^circ$C: Upcycling for Controllable Safety in Large Language Models

📅 2025-10-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Large language models (LLMs) face critical safety risks, including harmful content generation and jailbreaking attacks; existing mitigation strategies struggle to balance safety, utility, and controllability. To address this, we propose UpSafe$^circ$C—a modular, inference-time controllable, and dynamically activated safety enhancement framework. Our key contributions are: (1) the first introduction of a *safety temperature* mechanism, enabling fine-grained, Pareto-optimal trade-offs between safety and utility; (2) identification of safety-critical transformer layers and their restructuring into sparse Mixture-of-Experts (MoE) architectures, where soft-gated routing dynamically activates dedicated safety experts; and (3) a two-stage supervised fine-tuning procedure to strengthen safety discrimination. Extensive experiments across multiple benchmarks and model scales demonstrate that UpSafe$^circ$C significantly improves robustness against harmful outputs and jailbreaking attempts while preserving strong performance on general-purpose tasks—achieving superior safety, utility, and controllability.

Technology Category

Application Category

📝 Abstract
Large Language Models (LLMs) have achieved remarkable progress across a wide range of tasks, but remain vulnerable to safety risks such as harmful content generation and jailbreak attacks. Existing safety techniques -- including external guardrails, inference-time guidance, and post-training alignment -- each face limitations in balancing safety, utility, and controllability. In this work, we propose UpSafe$^circ$C, a unified framework for enhancing LLM safety through safety-aware upcycling. Our approach first identifies safety-critical layers and upcycles them into a sparse Mixture-of-Experts (MoE) structure, where the router acts as a soft guardrail that selectively activates original MLPs and added safety experts. We further introduce a two-stage SFT strategy to strengthen safety discrimination while preserving general capabilities. To enable flexible control at inference time, we introduce a safety temperature mechanism, allowing dynamic adjustment of the trade-off between safety and utility. Experiments across multiple benchmarks, base model, and model scales demonstrate that UpSafe$^circ$C achieves robust safety improvements against harmful and jailbreak inputs, while maintaining competitive performance on general tasks. Moreover, analysis shows that safety temperature provides fine-grained inference-time control that achieves the Pareto-optimal frontier between utility and safety. Our results highlight a new direction for LLM safety: moving from static alignment toward dynamic, modular, and inference-aware control.
Problem

Research questions and friction points this paper is trying to address.

Addressing safety vulnerabilities in LLMs against harmful content generation
Balancing safety, utility and controllability limitations in existing techniques
Enabling dynamic safety control during inference while maintaining general capabilities
Innovation

Methods, ideas, or system contributions that make the work stand out.

Upcycling safety-critical layers into sparse MoE structure
Two-stage SFT strategy preserves general capabilities
Safety temperature enables dynamic inference-time control
🔎 Similar Papers
No similar papers found.
Y
Yuhao Sun
University of Science and Technology of China
Zhuoer Xu
Zhuoer Xu
Independent researcher
S
Shiwen Cui
Independent researcher
K
Kun Yang
Independent researcher
Lingyun Yu
Lingyun Yu
Associate Professor, Xi'an Jiaotong-Liverpool University
Data VisualizationInteraction DesignAR/VR/MR
Y
Yongdong Zhang
University of Science and Technology of China
H
Hongtao Xie
University of Science and Technology of China