Distributionally Robust Token Optimization in RLHF

📅 2026-03-27
📈 Citations: 0
Influential: 0
📄 PDF

career value

178K/year
🤖 AI Summary
This work addresses the sensitivity of large language models to minor prompt perturbations in multi-step reasoning tasks and their poor robustness under distributional shift. The authors propose the first token-level distributionally robust optimization (DRO) framework integrated with reinforcement learning from human feedback (RLHF). By constructing ambiguity sets via f-divergence and introducing a novel span-level actor loss, the method adaptively focuses on challenging response segments to enhance consistency. Evaluated on the MATH-500 and LiveCodeBench benchmarks, the approach achieves absolute improvements of 4.4 and 2.7 percentage points, respectively, significantly boosting model stability and reasoning reliability under distributional shift.
📝 Abstract
Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.
Problem

Research questions and friction points this paper is trying to address.

distributional robustness
large language models
prompt sensitivity
reasoning robustness
distribution shift
Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributionally Robust Optimization
Token-level RLHF
f-divergence ambiguity sets
Span-level actor loss
Robustness to distribution shift
🔎 Similar Papers