CNT: Safety-oriented Function Reuse across LLMs via Cross-Model Neuron Transfer

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of flexibly and efficiently transferring safety capabilities from open-source large language models to meet dynamic safety requirements without retraining. The authors propose Cross-model Neuron Transfer (CNT), a novel approach that, for the first time, enables modular reuse of safety functionalities at the neuron level across different models. CNT selectively transfers a minimal subset of neurons from a donor model to a target model and integrates them via posterior functional adaptation and modular network editing, thereby supporting plug-and-play deployment and removal of safety features. Experiments across seven prominent large language models demonstrate that CNT achieves high transfer success rates in tasks including safety de-alignment, alignment enhancement, and bias mitigation, with performance degradation of less than 1%, significantly outperforming five baseline methods.

Technology Category

Application Category

📝 Abstract
The widespread deployment of large language models (LLMs) calls for post-hoc methods that can flexibly adapt models to evolving safety requirements. Meanwhile, the rapidly expanding open-source LLM ecosystem has produced a diverse collection of models that already exhibit various safety-related functionalities. This motivates a shift from constructing safety functionality from scratch to reusing existing functionality from external models, thereby avoiding costly data collection and training procedures. In this paper, we present Cross-Model Neuron Transfer (CNT), a post-hoc method that reuses safety-oriented functionality by transferring a minimal subset of neurons from an open-source donor LLM to a target LLM. By operating at the neuron level, CNT enables modular function-level adaptation, supporting both function addition andfunction deletion. We evaluate CNT on seven popular LLMs across three representative applications: safety disalignment, alignment enhancement, and bias removal. Experimental results show that CNT achieves targeted safety-oriented functionality transfer with minimal performance degradation (less than 1% for most models), consistently outperforming five baselines, demonstrating its generality and practical effectiveness.
Problem

Research questions and friction points this paper is trying to address.

safety-oriented function reuse
large language models
cross-model transfer
post-hoc adaptation
neuron-level transfer
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-Model Neuron Transfer
Safety Alignment
Function Reuse
Modular Adaptation
Post-hoc Safety Intervention
🔎 Similar Papers
No similar papers found.
Yue Zhao
Yue Zhao
University of Chinese Academy of Sciences
machine learning securityadversarial attackbackdoor attackCircular-SAR
Y
Yujia Gong
Institute of Information Engineering, Chinese Academy of Sciences
Ruigang Liang
Ruigang Liang
Institute of Information Engineering, Chinese Academy of Sciences
Cyber security
S
Shenchen Zhu
Institute of Information Engineering, Chinese Academy of Sciences
Kai Chen
Kai Chen
Institute of Information Engineering, Chinese Academy of Sciences
Software analysis and testingartificial intelligencesmartphonesprivacy
X
Xuejing Yuan
Beijing University of Posts and Telecommunications
W
Wangjun Zhang
Guangzhou University