🤖 AI Summary
This work addresses the sensitivity of large language models to minor prompt perturbations in multi-step reasoning tasks and their poor robustness under distributional shift. The authors propose the first token-level distributionally robust optimization (DRO) framework integrated with reinforcement learning from human feedback (RLHF). By constructing ambiguity sets via f-divergence and introducing a novel span-level actor loss, the method adaptively focuses on challenging response segments to enhance consistency. Evaluated on the MATH-500 and LiveCodeBench benchmarks, the approach achieves absolute improvements of 4.4 and 2.7 percentage points, respectively, significantly boosting model stability and reasoning reliability under distributional shift.
📝 Abstract
Large Language Models (LLMs) tend to respond correctly to prompts that align to the data they were trained and fine-tuned on. Yet, small shifts in wording, format, or language can trigger surprisingly large failures, especially on multi-step reasoning problems. To address this problem, we propose a Distributionally Robust Token Optimization (DRTO) approach, which combines token-level Reinforcement Learning from Human Feedback (RLHF) with Distributionally Robust Optimization (DRO). DRTO bounds worst case token-wise rewards by constructing an f-divergence ambiguity set over a loss minibatch, leading to a theoretical robustness. Empirically, DRTO enhances consistency under distribution shifts in mathematical reasoning benchmarks, achieving 9.17\% improvement on GSM8K and 2.49% improvement on MathQA.