SCOUT: A Defense Against Data Poisoning Attacks in Fine-Tuned Language Models

📅 2025-12-10
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Fine-tuning language models is vulnerable to contextually plausible, semantically stealthy backdoor attacks—e.g., disguised triggers in medical diagnosis or social media addiction classification—against which existing defenses exhibit limited detection capability. To address this, we propose SCOUT, the first general-purpose backdoor detection framework that requires no predefined contextual rules and operates at the token level via logit sensitivity analysis. SCOUT quantifies the impact of individual token removal on target-label logits using gradient- or perturbation-driven saliency mapping, enabling fine-grained token importance ranking and threshold-based detection. We introduce three novel context-aware attack variants—ViralApp, Fever, and Referral—and evaluate SCOUT across SST-2, IMDB, and AG News. It achieves >94% detection accuracy against both conventional (BadNet) and new attacks, with <0.8% degradation in clean-task accuracy—substantially outperforming state-of-the-art defenses.

Technology Category

Application Category

📝 Abstract
Backdoor attacks create significant security threats to language models by embedding hidden triggers that manipulate model behavior during inference, presenting critical risks for AI systems deployed in healthcare and other sensitive domains. While existing defenses effectively counter obvious threats such as out-of-context trigger words and safety alignment violations, they fail against sophisticated attacks using contextually-appropriate triggers that blend seamlessly into natural language. This paper introduces three novel contextually-aware attack scenarios that exploit domain-specific knowledge and semantic plausibility: the ViralApp attack targeting social media addiction classification, the Fever attack manipulating medical diagnosis toward hypertension, and the Referral attack steering clinical recommendations. These attacks represent realistic threats where malicious actors exploit domain-specific vocabulary while maintaining semantic coherence, demonstrating how adversaries can weaponize contextual appropriateness to evade conventional detection methods. To counter both traditional and these sophisticated attacks, we present extbf{SCOUT (Saliency-based Classification Of Untrusted Tokens)}, a novel defense framework that identifies backdoor triggers through token-level saliency analysis rather than traditional context-based detection methods. SCOUT constructs a saliency map by measuring how the removal of individual tokens affects the model's output logits for the target label, enabling detection of both conspicuous and subtle manipulation attempts. We evaluate SCOUT on established benchmark datasets (SST-2, IMDB, AG News) against conventional attacks (BadNet, AddSent, SynBkd, StyleBkd) and our novel attacks, demonstrating that SCOUT successfully detects these sophisticated threats while preserving accuracy on clean inputs.
Problem

Research questions and friction points this paper is trying to address.

Defends against data poisoning with contextually-appropriate triggers
Detects backdoor attacks in fine-tuned language models
Identifies subtle manipulation via token-level saliency analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Saliency-based token analysis for trigger detection
Detection of both conspicuous and subtle manipulation attempts
Framework effective against traditional and contextually-aware attacks
🔎 Similar Papers
No similar papers found.
Mohamed Afane
Mohamed Afane
Stanford University
Legal NLPMachine LearningAI for Public Health.
A
Abhishek Satyam
Department of Computer and Information Sciences, Fordham University, New York, NY , USA
K
Ke Chen
Department of Electrical Engineering, Zhejiang University, Hangzhou, China
T
Tao Li
Department of Systems Engineering, City University of Hong Kong, Hong Kong SAR, China
Junaid Farooq
Junaid Farooq
University of Michigan-Dearborn
NextG NetworksCyber-Physical SystemsCyber SecurityResilience
Juntao Chen
Juntao Chen
Department of Computer and Information Sciences, Fordham University
Cyber-Physical SystemsCyber Security and ResilienceGame and Decision TheoryOptimization and LearningSmart Grids