One Sample to Rule Them All: Extreme Data Efficiency in RL Scaling

📅 2026-01-06

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

This work addresses the data inefficiency of conventional reinforcement learning in large language models, which typically relies on vast quantities of high-quality samples. The authors propose Polymath Learning, a novel framework that leverages a single, carefully engineered interdisciplinary synthetic sample to elicit multifaceted reasoning capabilities in the model. This approach provides the first empirical validation of one-shot reinforcement learning in large language models. Central to the framework is a new paradigm termed “sample engineering,” which prioritizes the quality and structural design of individual samples over sheer volume. Experimental results demonstrate that Polymath Learning significantly outperforms existing methods—despite their reliance on substantially larger datasets—across multiple reasoning benchmarks in physics, chemistry, biology, and other domains, achieving highly effective reinforcement learning under extremely low-data conditions.

Technology Category

Application Category

📝 Abstract

The reasoning ability of large language models (LLMs) can be unleashed with reinforcement learning (RL) (OpenAI, 2024; DeepSeek-AI et al., 2025a; Zeng et al., 2025). The success of existing RL attempts in LLMs usually relies on high-quality samples of thousands or beyond. In this paper, we challenge fundamental assumptions about data requirements in RL for LLMs by demonstrating the remarkable effectiveness of one-shot learning. Specifically, we introduce polymath learning, a framework for designing one training sample that elicits multidisciplinary impact. We present three key findings: (1) A single, strategically selected math reasoning sample can produce significant performance improvements across multiple domains, including physics, chemistry, and biology with RL; (2) The math skills salient to reasoning suggest the characteristics of the optimal polymath sample; and (3) An engineered synthetic sample that integrates multidiscipline elements outperforms training with individual samples that naturally occur. Our approach achieves superior performance to training with larger datasets across various reasoning benchmarks, demonstrating that sample quality and design, rather than quantity, may be the key to unlock enhanced reasoning capabilities in language models. Our results suggest a shift, dubbed as sample engineering, toward precision engineering of training samples rather than simply increasing data volume.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

large language models

data efficiency

one-shot learning

reasoning

Innovation

Methods, ideas, or system contributions that make the work stand out.

polymath learning

one-shot reinforcement learning

sample engineering