Tool Verification for Test-Time Reinforcement Learning

📅 2026-03-02

📈 Citations: 0

✨ Influential: 0

career value

171K/year

🤖 AI Summary

This work proposes a verification-driven self-evolution framework to address reward bias and mode collapse in test-time reinforcement learning, which are often caused by frequent erroneous consensus among reasoning trajectories. The method introduces external tools—such as code execution—into the test-time reward mechanism for the first time, enabling direct validation of reasoning correctness. It further designs a verification-aware majority voting strategy to generate high-quality pseudo-labels for online self-training. By integrating verification signals into data synthesis, the framework significantly enhances the performance of large reasoning models on challenging mathematical benchmarks, including MATH-500, AMC, and AIME 2024. Notably, it outperforms existing test-time reinforcement learning baselines on difficult problems and effectively stabilizes the self-evolution process.

Technology Category

Application Category

📝 Abstract

Test-time reinforcement learning (TTRL) has emerged as a promising paradigm for self-evolving large reasoning models (LRMs), enabling online adaptation on unlabeled test inputs via self-induced rewards through majority voting. However, a spurious yet high-frequency unverified consensus can become a biased and reinforced reward signal, leading to incorrect mode collapse. We address this failure mode with T^3RL (Tool-Verification for Test-Time Reinforcement Learning), which introduces test-time tool verification into reward estimation. Concretely, a verifier uses an external tool as evidence (e.g., from code execution) to upweight verified rollouts in a verification-aware voting, producing more reliable pseudo-labels for training. Across various math difficulties (MATH-500, AMC, and AIME 2024) and diverse backbone types, T^3RL significantly improves over TTRL, with larger gains on harder problems. More broadly, T^3RL can be viewed as verified online data synthesis, highlighting test-time tool verification as a key mechanism for stabilizing self-evolution.

Problem

Research questions and friction points this paper is trying to address.

test-time reinforcement learning

mode collapse

spurious consensus

reward signal bias

self-evolving models

Innovation

Methods, ideas, or system contributions that make the work stand out.

test-time reinforcement learning

tool verification

self-evolving models