Unifying Adversarial Robustness and Training Across Text Scoring Models

📅 2026-01-31

📈 Citations: 0

✨ Influential: 0

career value

193K/year

🤖 AI Summary

Existing research on adversarial robustness often focuses on specific models or tasks, making it difficult to uncover common vulnerabilities across text scoring models such as dense retrievers, rerankers, and reward models. This work presents the first unified framework for adversarial robustness within the context of text scoring, introducing a tailored attack method that is transferable across model roles and a composite adversarial training strategy. By leveraging the verifiability inherent in scoring tasks, the proposed approach substantially enhances models’ generalization robustness against diverse attacks and improves downstream task performance. Furthermore, it effectively mitigates reward hacking in reinforcement learning from human feedback (RLHF), thereby facilitating the training of large language models that are better aligned with human intent.

Technology Category

Application Category

📝 Abstract

Research on adversarial robustness in language models is currently fragmented across applications and attacks, obscuring shared vulnerabilities. In this work, we propose unifying the study of adversarial robustness in text scoring models spanning dense retrievers, rerankers, and reward models. This motivates adapting both attacks and adversarial training methods across model roles. Unlike open-ended generation, text scoring failures are directly testable: an attack succeeds when an irrelevant or rejected text outscores a relevant or chosen one. Using this principled lens of text scoring, we demonstrate that current adversarial training formulations for language models are often short-sighted, failing to effectively generalize across attacks. To address this, we introduce multiple adversarial training methods for text scoring models and show that combining complementary training methods can yield strong robustness while also improving task effectiveness. We also highlight the practical value of our approach for RLHF, showing that our adversarially trained reward models mitigate reward hacking and support the training of better-aligned LLMs. We provide our code and models for further study.

Problem

Research questions and friction points this paper is trying to address.

adversarial robustness

text scoring models

adversarial training

reward hacking

language models

Innovation

Methods, ideas, or system contributions that make the work stand out.

adversarial robustness

text scoring models

adversarial training