Backdoor4Good: Benchmarking Beneficial Uses of Backdoors in LLMs

📅 2026-03-08

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work proposes Backdoor4Good (B4G), a novel framework that reimagines traditional backdoor mechanisms—typically viewed as security threats—as controllable interfaces to enhance the trustworthiness of large language models (LLMs). By formalizing beneficial backdoors through a “trigger–activation mechanism–utility function” triplet, B4G systematically explores constructive applications of backdoors and establishes a unified benchmark spanning four trust-related scenarios. Adopting a modular, interpretable, and non-malicious design paradigm, the method enables highly controllable, tamper-resistant, and stealthy backdoor injection across multiple LLMs, including Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B. Experiments demonstrate that B4G preserves original task performance while effectively supporting secure control, behavioral constraints, and auditability, offering a new building block for trustworthy AI systems.

Technology Category

Application Category

📝 Abstract

Backdoor mechanisms have traditionally been studied as security threats that compromise the integrity of machine learning models. However, the same mechanism -- the conditional activation of specific behaviors through input triggers -- can also serve as a controllable and auditable interface for trustworthy model behavior. In this work, we present \textbf{Backdoor4Good (B4G)}, a unified benchmark and framework for \textit{beneficial backdoor} applications in large language models (LLMs). Unlike conventional backdoor studies focused on attacks and defenses, B4G repurposes backdoor conditioning for Beneficial Tasks that enhance safety, controllability, and accountability. It formalizes beneficial backdoor learning under a triplet formulation $(T, A, U)$, representing the \emph{Trigger}, \emph{Activation mechanism}, and \emph{Utility function}, and implements a benchmark covering four trust-centric applications. Through extensive experiments across Llama3.1-8B, Gemma-2-9B, Qwen2.5-7B, and Llama2-13B, we show that beneficial backdoors can achieve high controllability, tamper-resistance, and stealthiness while preserving clean-task performance. Our findings demonstrate new insights that backdoors need not be inherently malicious; when properly designed, they can serve as modular, interpretable, and beneficial building blocks for trustworthy AI systems. Our code and datasets are available at https://github.com/bboylyg/BackdoorLLM/B4G.

Problem

Research questions and friction points this paper is trying to address.

backdoor

large language models

trustworthy AI

beneficial backdoor

model controllability

Innovation

Methods, ideas, or system contributions that make the work stand out.

beneficial backdoor

trustworthy AI

controllable LLMs