Rethinking Reinforcement fine-tuning of LLMs: A Multi-armed Bandit Learning Perspective

📅 2026-01-21

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work addresses the unclear roles and bottlenecks of various design choices in current reinforcement-based fine-tuning of large language models, where empirical findings often conflict. It proposes the first systematic framework that formulates reinforcement fine-tuning as a multi-armed bandit problem with an extremely large discrete action space. Starting from a minimal configuration, the framework employs hierarchical ablation studies and direct reward learning without advantage estimation to incrementally evaluate the impact of key components. Experiments across three large language models and two reasoning benchmarks reveal the true contributions of individual elements, clarify several common misconceptions, and offer both novel insights into the underlying mechanisms and practical guidance for effective reinforcement fine-tuning.

Technology Category

Application Category

📝 Abstract

A large number of heuristics have been proposed to optimize the reinforcement fine-tuning of LLMs. However, inconsistent claims are made from time to time, making this area elusive. Reflecting on this situation, two fundamental questions still lack a clear understanding: 1) what is the role of each optimizing choice? 2) which ones are the bottlenecks? This paper aims to shed light on them, and it faces the challenge of several entangled confounding factors in the fine-tuning process. To tackle this challenge, we propose a bottom-up experiment pipeline. The bottom layer is composed of a minimalist configuration: one training data, one rollout per round and the reward directly serve as the learning signal without advantage function design. This minimalist configuration connects to multi-armed bandit learning with extremely large discrete action space, which offers theories to corroborate the experiment findings. The up procedure of the experiment pipeline expanding the minimalist configuration layer by layer, examining the role of each design choice. Experimental results on three LLMs and two reasoning datasets not only reveal new understanding of the design choice but also yield essential insights to shape the area.

Problem

Research questions and friction points this paper is trying to address.

reinforcement fine-tuning

large language models

multi-armed bandit

optimization choices

bottlenecks

Innovation

Methods, ideas, or system contributions that make the work stand out.

multi-armed bandit

reinforcement fine-tuning

large language models