BOTS: A Unified Framework for Bayesian Online Task Selection in LLM Reinforcement Finetuning

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

To address low task selection efficiency, inaccurate difficulty estimation, high computational overhead, and poor adaptability in reinforcement fine-tuning (RFT) of large language models (LLMs), this paper proposes BOTS—a Bayesian online task selection framework. BOTS is the first to integrate Bayesian online learning with Thompson sampling for dynamic task selection, jointly leveraging explicit feedback (e.g., reward signals) and implicit evidence extracted via a lightweight interpolation model from replay-free trajectories. This enables fine-grained, real-time task difficulty estimation and adaptive scheduling—without additional inference cost. Empirical results demonstrate that BOTS significantly improves training data utilization efficiency and consistently outperforms mainstream task selection baselines across diverse domains and LLM scales. It achieves superior final performance and faster convergence, establishing a new state-of-the-art in RFT task orchestration.

Technology Category

Application Category

📝 Abstract

Reinforcement finetuning (RFT) is a key technique for aligning Large Language Models (LLMs) with human preferences and enhancing reasoning, yet its effectiveness is highly sensitive to which tasks are explored during training. Uniform task sampling is inefficient, wasting computation on tasks that are either trivial or unsolvable, while existing task selection methods often suffer from high rollout costs, poor adaptivity, or incomplete evidence. We introduce extbf{BOTS}, a unified framework for extbf{B}ayesian extbf{O}nline extbf{T}ask extbf{S}election in LLM reinforcement finetuning. Grounded in Bayesian inference, BOTS adaptively maintains posterior estimates of task difficulty as the model evolves. It jointly incorporates emph{explicit evidence} from direct evaluations of selected tasks and emph{implicit evidence} inferred from these evaluations for unselected tasks, with Thompson sampling ensuring a principled balance between exploration and exploitation. To make implicit evidence practical, we instantiate it with an ultra-light interpolation-based plug-in that estimates difficulties of unevaluated tasks without extra rollouts, adding negligible overhead. Empirically, across diverse domains and LLM scales, BOTS consistently improves data efficiency and performance over baselines and ablations, providing a practical and extensible solution for dynamic task selection in RFT.

Problem

Research questions and friction points this paper is trying to address.

Optimizing task selection efficiency during LLM reinforcement finetuning training

Reducing rollout costs and improving adaptivity in task selection methods

Balancing exploration and exploitation for dynamic task difficulty estimation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Bayesian online task selection for reinforcement finetuning

Maintains posterior estimates of task difficulty adaptively

Uses implicit evidence interpolation without extra rollouts

🔎 Similar Papers

BoRA: Bayesian Hierarchical Low-Rank Adaption for Multi-task Large Language Models