WARP: Guaranteed Inner-Layer Repair of NLP Transformers

📅 2026-04-01
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing Transformer repair methods struggle to balance verifiability with the parameter search space, often restricting repairs to the final layer or small networks and thus failing to effectively defend against adversarial perturbations. This work proposes WARP, a framework that, for the first time, extends theoretically guaranteed repair capabilities to multiple internal layers of Transformers. By formulating a convex quadratic programming objective based on first-order linearization and integrating Lipschitz continuity analysis with sensitivity-driven preprocessing, WARP provides three sample-level guarantees: correct classification, invariance of the retention set, and a certified robust radius. Experiments demonstrate that WARP effectively satisfies repair constraints across diverse encoder architectures, significantly enhancing adversarial robustness while preserving model generalization performance.
📝 Abstract
Transformer-based NLP models remain vulnerable to adversarial perturbations, yet existing repair methods face a fundamental trade-off: gradient-based approaches offer flexibility but lack verifiability and often overfit; methods that do provide repair guarantees are restricted to the final layer or small networks, significantly limiting the parameter search space available for repair. We present WARP (Weight-Adjusted Repair with Provability), a constraint-based repair framework that extends repair beyond the last layer of Transformer models. WARP formulates repair as a convex quadratic program derived from a first-order linearization of the logit gap, enabling tractable optimization over a high-dimensional parameter space. Under the condition that the first-order approximation holds, this formulation induces three per-sample guarantees: (i) a positive margin constraint ensuring correct classification on repaired inputs, (ii) preservation constraints over a designated remain set, and (iii) a certified robustness radius derived from Lipschitz continuity. To ensure feasibility across varying model architectures, we introduce a sensitivity-based preprocessing step that conditions the optimization landscape accordingly. We further show that the iterative optimization procedure converges to solutions satisfying all repair constraints under mild assumptions. Empirical evaluation on encoder-only Transformers with varying layer architectures validates that these guarantees hold in practice while improving robustness to adversarial inputs. Our results demonstrate that guaranteed, generalizable Transformer repair is achievable through principled constraint-based optimization.
Problem

Research questions and friction points this paper is trying to address.

adversarial robustness
model repair
Transformer models
verifiable guarantees
NLP
Innovation

Methods, ideas, or system contributions that make the work stand out.

guaranteed repair
constraint-based optimization
Transformer robustness
convex quadratic programming
adversarial robustness
🔎 Similar Papers
No similar papers found.
Hsin-Ling Hsu
Hsin-Ling Hsu
National Chengchi University
Information RetrievalNatural Language ProcessingAI for HealthcareTrustworthy AI
M
Min-Yu Chen
Department of Management Information Systems, National Chengchi University
N
Nai-Chia Chen
Department of Management Information Systems, National Chengchi University
Y
Yan-Ru Chen
Department of Management Information Systems, National Chengchi University
Y
Yi-Ling Chang
Department of Management Information Systems, National Chengchi University
Fang Yu
Fang Yu
Associate Professor, Dept. Management Information Systems, National Chengchi University
Software VerificationString AnalysisAutomata TheoryWeb Security