Towards Poisoning Robustness Certification for Natural Language Generation

📅 2026-02-10

📈 Citations: 0

✨ Influential: 0

career value

181K/year

🤖 AI Summary

Existing poisoning robustness certification methods struggle to apply to autoregressive language generation tasks due to challenges posed by sequential dependencies and exponentially large output spaces. This work proposes the first poisoning robustness certification framework tailored for natural language generation, introducing two security properties—“stability” and “validity”—and designing a Targeted Partition Aggregation (TPA) algorithm that integrates mixed-integer linear programming (MILP) to enhance certification tightness. The approach provides the first provable defense against targeted poisoning attacks on generation validity and supports multi-turn generation scenarios. Experimental results demonstrate that, in tool-use tasks, validity can be certified under up to 0.5% data poisoning, while in preference alignment tasks, stability is guaranteed for sequences up to eight tokens.

Technology Category

Application Category

📝 Abstract

Understanding the reliability of natural language generation is critical for deploying foundation models in security-sensitive domains. While certified poisoning defenses provide provable robustness bounds for classification tasks, they are fundamentally ill-equipped for autoregressive generation: they cannot handle sequential predictions or the exponentially large output space of language models. To establish a framework for certified natural language generation, we formalize two security properties: stability (robustness to any change in generation) and validity (robustness to targeted, harmful changes in generation). We introduce Targeted Partition Aggregation (TPA), the first algorithm to certify validity/targeted attacks by computing the minimum poisoning budget needed to induce a specific harmful class, token, or phrase. Further, we extend TPA to provide tighter guarantees for multi-turn generations using mixed integer linear programming (MILP). Empirically, we demonstrate TPA's effectiveness across diverse settings including: certifying validity of agent tool-calling when adversaries modify up to 0.5% of the dataset and certifying 8-token stability horizons in preference-based alignment. Though inference-time latency remains an open challenge, our contributions enable certified deployment of language models in security-critical applications.

Problem

Research questions and friction points this paper is trying to address.

poisoning robustness

natural language generation

certified defenses

autoregressive models

security certification

Innovation

Methods, ideas, or system contributions that make the work stand out.

certified robustness

natural language generation

poisoning defense