UFO-RL: Uncertainty-Focused Optimization for Efficient Reinforcement Learning Data Selection

📅 2025-05-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the low data selection efficiency and high computational cost in reinforcement learning fine-tuning of large language models (LLMs), this paper proposes a single-forward uncertainty estimation framework grounded in Vygotsky’s Zone of Proximal Development (ZPD) cognitive theory. It introduces the first ZPD-inspired approach for RL data filtering, adaptively defining ZPD boundaries via learnable uncertainty modeling. By replacing multi-sample evaluation with a single forward pass, the method achieves 185× computational speedup. Integrated with policy-optimization–guided data reweighting and lightweight confidence calibration, it significantly improves selection accuracy. Experiments demonstrate that the method attains full-data performance using only 10% of training samples, delivers up to 16× end-to-end training acceleration, and markedly enhances training stability and cross-task generalization.

Technology Category

Application Category

📝 Abstract
Scaling RL for LLMs is computationally expensive, largely due to multi-sampling for policy optimization and evaluation, making efficient data selection crucial. Inspired by the Zone of Proximal Development (ZPD) theory, we hypothesize LLMs learn best from data within their potential comprehension zone. Addressing the limitation of conventional, computationally intensive multi-sampling methods for data assessment, we introduce UFO-RL. This novel framework uses a computationally efficient single-pass uncertainty estimation to identify informative data instances, achieving up to 185x faster data evaluation. UFO-RL leverages this metric to select data within the estimated ZPD for training. Experiments show that training with just 10% of data selected by UFO-RL yields performance comparable to or surpassing full-data training, reducing overall training time by up to 16x while enhancing stability and generalization. UFO-RL offers a practical and highly efficient strategy for scaling RL fine-tuning of LLMs by focusing learning on valuable data.
Problem

Research questions and friction points this paper is trying to address.

Efficient data selection for RL in LLMs to reduce computational costs
Identify optimal training data using single-pass uncertainty estimation
Improve RL fine-tuning speed and performance with minimal data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Single-pass uncertainty estimation for data selection
Focuses on data within Zone of Proximal Development
Achieves 185x faster data evaluation speed
🔎 Similar Papers
Y
Yang Zhao
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China
K
Kai Xiong
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China
X
Xiao Ding
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China
L
Li Du
Beijing Academy of Artificial Intelligence, Beijing, China
Y
YangouOuyang
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China
Zhouhao Sun
Zhouhao Sun
Harbin Institute of Technology
NLP
J
Jiannan Guan
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China
W
Wenbin Zhang
Du Xiaoman Technology (Beijing) Co., Ltd.
B
Bin Liu
Du Xiaoman Technology (Beijing) Co., Ltd.
D
Dong Hu
Du Xiaoman Technology (Beijing) Co., Ltd.
Bing Qin
Bing Qin
Professor in Harbin Institute of Technology
Natural Language ProcessingInformation ExtractionSentiment Analysis
T
Ting Liu
Research Center for Social Computing and Information Retrieval, Harbin Institute of Technology, China