🤖 AI Summary
This work addresses the vulnerability of large language models to backdoor attacks during fine-tuning, where poisoned samples induce targeted misbehavior. To counter this threat, the authors propose GradSentry, a novel defense mechanism that leverages the observation—first identified in this study—that gradients from individual poisoned samples exhibit higher spectral entropy. Building on this insight, GradSentry enables backdoor detection without requiring clustering, access to the training process, or modifications to parameter-efficient fine-tuning methods such as LoRA. The method operates by analyzing per-sample gradients alone and remains effective across an extreme range of poisoning rates (1%–90%). Evaluated on four question-answering datasets against four distinct attack types, GradSentry incurs only 20–50 ms of computational overhead per sample while achieving robust detection performance.
📝 Abstract
Fine-tuning Large Language Models with untrusted data exposes models to backdoor attacks, where poisoned samples cause targeted misbehavior. Existing sample-filtering defenses rely on clustering, which requires sufficient data and can fail at extreme poison ratios. We propose GradSentry ({Grad}ient {Sentry}), a backdoor sample filtering method based on the spectral entropy of per-sample gradients. Our key finding is that poisoned samples produce gradients with higher spectral entropy compared to clean samples. GradSentry captures output-altering backdoor signatures using per-sample gradient spectra, avoiding pairwise sample comparisons and clustering during feature construction. Importantly, our method is training-agnostic: it works for both parameter-efficient fine-tuning methods like LoRA and full-parameter tuning, as the gradient analysis operates independently of which parameters are being updated during training. GradSentry requires no clustering, operates effectively across all poison ratios (1%--90%), and introduces minimal computational overhead (20-50ms per sample for 7B model). Evaluation on four QA datasets and four attack types demonstrates the effectiveness of spectral entropy for backdoor detection. Code is available at https://github.com/dongdongzhaoUP/GradSentry.