🤖 AI Summary
To address low task selection efficiency, inaccurate difficulty estimation, high computational overhead, and poor adaptability in reinforcement fine-tuning (RFT) of large language models (LLMs), this paper proposes BOTS—a Bayesian online task selection framework. BOTS is the first to integrate Bayesian online learning with Thompson sampling for dynamic task selection, jointly leveraging explicit feedback (e.g., reward signals) and implicit evidence extracted via a lightweight interpolation model from replay-free trajectories. This enables fine-grained, real-time task difficulty estimation and adaptive scheduling—without additional inference cost. Empirical results demonstrate that BOTS significantly improves training data utilization efficiency and consistently outperforms mainstream task selection baselines across diverse domains and LLM scales. It achieves superior final performance and faster convergence, establishing a new state-of-the-art in RFT task orchestration.
📝 Abstract
Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce extbf{BOTS}, a unified framework for extbf{B}ayesian extbf{O}nline extbf{T}ask extbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates emph{explicit evidence} from direct evaluations of selected tasks and emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.