Beyond Semantic Manipulation: Token-Space Attacks on Reward Models

📅 2026-04-02

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

Although existing reward models are known to be vulnerable to semantic adversarial attacks, their fragility at the non-semantic level remains underexplored. This work proposes the Token Mapping Perturbation Attack (TOMPA) framework, which for the first time enables black-box adversarial optimization of reward models directly in the original token space, bypassing the decode–retokenize interface and relying solely on scalar feedback to generate sequences that achieve high rewards despite being semantically meaningless. Experiments demonstrate that TOMPA nearly doubles the average reward on Skywork-Reward-V2-Llama-3.1-8B, with 98.0% of generated samples surpassing the reward of GPT-5 reference responses, thereby exposing a critical and systematic vulnerability in current reward models at the non-semantic level.

Technology Category

Application Category

📝 Abstract

Reward models (RMs) are widely used as optimization targets in reinforcement learning from human feedback (RLHF), yet they remain vulnerable to reward hacking. Existing attacks mainly operate within the semantic space, constructing human-readable adversarial outputs that exploit RM biases. In this work, we introduce a fundamentally different paradigm: Token Mapping Perturbation Attack (TOMPA), a framework that performs adversarial optimization directly in token space. By bypassing the standard decode-re-tokenize interface between the policy and the reward model, TOMPA enables the attack policy to optimize over raw token sequences rather than coherent natural language. Using only black-box scalar feedback, TOMPA automatically discovers non-linguistic token patterns that elicit extremely high rewards across multiple state-of-the-art RMs. Specifically, when targeting Skywork-Reward-V2-Llama-3.1-8B, TOMPA nearly doubles the reward of GPT-5 reference answers and outperforms them on 98.0% of prompts. Despite these high scores, the generated outputs degenerate into nonsensical text, revealing that RMs can be systematically exploited beyond the semantic regime and exposing a critical vulnerability in current RLHF pipelines.

Problem

Research questions and friction points this paper is trying to address.

reward hacking

reward models

adversarial attacks

token space

RLHF

Innovation

Methods, ideas, or system contributions that make the work stand out.

token-space attack

reward model vulnerability

adversarial optimization