One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

📅 2026-01-06
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the data inefficiency of conventional reinforcement learning in large language models, which typically relies on vast quantities of high-quality samples. The authors propose Polymath Learning, a novel framework that leverages a single, carefully engineered interdisciplinary synthetic sample to elicit multifaceted reasoning capabilities in the model. This approach provides the first empirical validation of one-shot reinforcement learning in large language models. Central to the framework is a new paradigm termed “sample engineering,” which prioritizes the quality and structural design of individual samples over sheer volume. Experimental results demonstrate that Polymath Learning significantly outperforms existing methods—despite their reliance on substantially larger datasets—across multiple reasoning benchmarks in physics, chemistry, biology, and other domains, achieving highly effective reinforcement learning under extremely low-data conditions.

Technology Category

Application Category

📝 Abstract
The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
large language models
data efficiency
one-shot learning
reasoning
Innovation

Methods, ideas, or system contributions that make the work stand out.

polymath learning
one-shot reinforcement learning
sample engineering
data efficiency
multidisciplinary reasoning
🔎 Similar Papers
No similar papers found.
Yiyuan Li
Yiyuan Li
University of North Carolina at Chapel Hill
Natural Language ProcessingComputational Linguistics
Z
Zhen Huang
Taobao & Tmall Group of Alibaba, GAIR
Yanan Wu
Yanan Wu
Alibaba Group
AI
Weixun Wang
Weixun Wang
Alibaba
X
Xuefeng Li
Taobao & Tmall Group of Alibaba, GAIR
Y
Yijia Luo
Taobao & Tmall Group of Alibaba
W
Wenbo Su
Taobao & Tmall Group of Alibaba
Bo Zheng
Bo Zheng
Researcher, Alibaba Group
AINetworkE-Commerce
P
Pengfei Liu
Taobao & Tmall Group of Alibaba, GAIR