TTRL: Test-Time Reinforcement Learning

📅 2025-04-22

📈 Citations: 0

✨ Influential: 0

career value

173K/year

🤖 AI Summary

Reinforcement learning (RL) for large language models (LLMs) in inference tasks is hindered by the absence of ground-truth labels on unlabeled test data, making reward estimation unreliable. Method: We propose “test-time RL”, a novel paradigm that leverages pretrained model priors to construct implicit reward signals via test-time majority voting (Maj@N) and autoregressive sampling, followed by test-time scaling (TTS) and online policy optimization—eliminating reliance on human-annotated rewards. Contribution/Results: This is the first unsupervised framework enabling autonomous, in-context evolution during inference. On the AIME 2024 benchmark, Qwen-2.5-Math-7B achieves a 159% improvement in pass@1, approaching the performance upper bound of full supervised fine-tuning. Our approach significantly extends the applicability of RL to open-domain reasoning without supervision.

Technology Category

Application Category

📝 Abstract

This paper investigates Reinforcement Learning (RL) on data without explicit labels for reasoning tasks in Large Language Models (LLMs). The core challenge of the problem is reward estimation during inference while not having access to ground-truth information. While this setting appears elusive, we find that common practices in Test-Time Scaling (TTS), such as majority voting, yield surprisingly effective rewards suitable for driving RL training. In this work, we introduce Test-Time Reinforcement Learning (TTRL), a novel method for training LLMs using RL on unlabeled data. TTRL enables self-evolution of LLMs by utilizing the priors in the pre-trained models. Our experiments demonstrate that TTRL consistently improves performance across a variety of tasks and models. Notably, TTRL boosts the pass@1 performance of Qwen-2.5-Math-7B by approximately 159% on the AIME 2024 with only unlabeled test data. Furthermore, although TTRL is only supervised by the Maj@N metric, TTRL has demonstrated performance to consistently surpass the upper limit of the initial model, and approach the performance of models trained directly on test data with ground-truth labels. Our experimental findings validate the general effectiveness of TTRL across various tasks, and highlight TTRL's potential for broader tasks and domains. GitHub: https://github.com/PRIME-RL/TTRL

Problem

Research questions and friction points this paper is trying to address.

Reinforcement Learning on unlabeled data for LLMs

Reward estimation without ground-truth information

Self-evolution of LLMs using pre-trained priors

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses majority voting for reward estimation

Leverages pre-trained model priors for self-evolution

Applies RL on unlabeled test data

🔎 Similar Papers

No similar papers found.