Plato's Form: Toward Backdoor Defense-as-a-Service for LLMs with Prototype Representations

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the vulnerability of large language models to backdoor attacks in safety-critical scenarios, where existing defenses fall short of meeting the requirements for Backdoor Defense as a Service (BDaaS)—namely reusability, low dependency, and cross-model generalization. The authors propose PROTOPURIFY, a lightweight purification framework that constructs a backdoor vector pool, clusters it into prototype representations, and identifies critical boundary layers via inter-layer prototype alignment to selectively suppress contaminated parameters—without requiring clean data or prior knowledge of triggers. PROTOPURIFY is the first approach to enable reusable, customizable, interpretable, and efficient BDaaS, reducing attack success rates to 1.6%–10% across diverse models and tasks while incurring less than 3% clean performance degradation, and demonstrating strong robustness against both adaptive and triggerless attacks.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) are increasingly deployed in security-sensitive applications, yet remain vulnerable to backdoor attacks. However, existing backdoor defenses are difficult to operationalize for Backdoor Defense-as-a-Service (BDaaS), as they require unrealistic side information (e.g., downstream clean data, known triggers/targets, or task domain specifics), and lack reusable, scalable purification across diverse backdoored models. In this paper, we present PROTOPURIFY, a backdoor purification framework via parameter edits under minimal assumptions. PROTOPURIFY first builds a backdoor vector pool from clean and backdoored model pairs, aggregates vectors into candidate prototypes, and selects the most aligned candidate for the target model via similarity matching. PROTOPURIFY then identifies a boundary layer through layer-wise prototype alignment and performs targeted purification by suppressing prototype-aligned components in the affected layers, achieving fine-grained mitigation with minimal impact on benign utility. Designed as a BDaaS-ready primitive, PROTOPURIFY supports reusability, customizability, interpretability, and runtime efficiency. Experiments across various LLMs on both classification and generation tasks show that PROTOPURIFY consistently outperforms 6 representative defenses against 6 diverse attacks, including single-trigger, multi-trigger, and triggerless backdoor settings. PROTOPURIFY reduces ASR to below 10%, and even as low as 1.6% in some cases, while incurring less than a 3% drop in clean utility. PROTOPURIFY further demonstrates robustness against adaptive backdoor variants and stability on non-backdoored models.

Problem

Research questions and friction points this paper is trying to address.

backdoor defense

large language models

Defense-as-a-Service

model purification

backdoor attacks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Backdoor Defense-as-a-Service

Prototype Representation

Parameter Editing