Distributionally Robust Token Optimization in RLHF

📅 2026-03-27

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

This work addresses the sensitivity of large language models to minor prompt perturbations in multi-step reasoning tasks and their poor robustness under distributional shift. The authors propose the first token-level distributionally robust optimization (DRO) framework integrated with reinforcement learning from human feedback (RLHF). By constructing ambiguity sets via f-divergence and introducing a novel span-level actor loss, the method adaptively focuses on challenging response segments to enhance consistency. Evaluated on the MATH-500 and LiveCodeBench benchmarks, the approach achieves absolute improvements of 4.4 and 2.7 percentage points, respectively, significantly boosting model stability and reasoning reliability under distributional shift.

📝 Abstract

Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.

Problem

Research questions and friction points this paper is trying to address.

distributional robustness

large language models

prompt sensitivity

reasoning robustness

distribution shift

Innovation

Methods, ideas, or system contributions that make the work stand out.

Distributionally Robust Optimization

Token-level RLHF

f-divergence ambiguity sets