Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

📅 2026-04-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the sparsity and ambiguity inherent in response-level rewards within rubric-based reinforcement learning by proposing the Rubrics to Tokens (RTT) framework, which establishes the first bridge from response-level scoring to token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to identify tokens most pertinent to the rubric criteria and integrates both response-level and token-level advantage estimation to optimize the policy model. Furthermore, it presents the RTT-GRPO algorithm alongside an intra-sample token-group normalization technique tailored for the three-dimensional token-level reward space. Experimental results demonstrate that RTT significantly improves instruction-following accuracy and alignment with rubric criteria across multiple models, consistently outperforming existing baselines.
📝 Abstract
Rubric-based Reinforcement Learning (RL) has emerged as a promising approach for aligning Large Language Models (LLMs) with complex, open-domain instruction following tasks. However, existing methods predominantly rely on response-level rewards, introducing severe reward sparsity and reward ambiguity problems. To address these issues, we propose Rubrics to Tokens (RTT), a novel rubric-based RL framework that bridges coarse response-level scores and fine-grained token-level credit assignment. RTT introduces a Token-Level Relevance Discriminator to predict which tokens in the response are responsible for a specific constraint, and optimizes the policy model via RTT-GRPO, which integrates response-level and token-level advantages within a unified framework. Furthermore, when transitioning from one-dimensional, outcome-level reward to three-dimensional reward space in the token-level rubric-based RL, we propose a novel group normalization method, called Intra-sample Token Group Normalization, to accommodate this shift. Extensive experiments and benchmarks demonstrate that RTT consistently outperforms other baselines in both instruction- and rubric-level accuracy across different models.
Problem

Research questions and friction points this paper is trying to address.

reward sparsity
reward ambiguity
instruction following
rubric-based reinforcement learning
token-level credit assignment
Innovation

Methods, ideas, or system contributions that make the work stand out.

Rubric-based Reinforcement Learning
Token-level Reward
Credit Assignment
Group Normalization
Instruction Following
🔎 Similar Papers
No similar papers found.
T
Tianze Xu
Shanghai Jiao Tong University
Y
Yanzhao Zheng
Alibaba Group
P
Pengrui Lu
Shanghai Jiao Tong University
Lyumanshan Ye
Lyumanshan Ye
Shanghai Jiao Tong Univeristy
Human-Computer Interaction
Y
Yong Wu
Zhejiang University
Z
Zhentao Zhang
Alibaba Group
Y
Yuanqiang Yu
Alibaba Group
C
Chao Ma
Alibaba Group
J
Jihuai Zhu
Alibaba Group
Pengfei Liu
Pengfei Liu
Associate professor at Shanghai Jiao Tong University
LLM
B
Baohua Dong
Alibaba Group
H
Hangcheng Zhu
Alibaba Group
R
Ruohui Huang
Alibaba Group
G
Gang Yu
Alibaba Group