TEMPO: Scaling Test-time Training for Large Reasoning Models

📅 2026-04-21

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

Existing test-time training methods often saturate rapidly on large reasoning models, failing to effectively leverage test-time computation and consequently suffering from performance plateaus and reduced output diversity. This work proposes TEMPO, a novel framework that formalizes test-time training as an Expectation-Maximization (EM) algorithm. TEMPO alternates between optimizing a policy model—fine-tuned on unlabeled test samples—and periodically recalibrating a critic model using labeled data, thereby addressing the critical limitation of prior approaches that neglect critic updates. Evaluated on the AIME 2024 benchmark, TEMPO achieves substantial performance gains: accuracy improves from 33.0% to 51.1% for OLMO3-7B and from 42.3% to 65.8% for Qwen3-14B, while effectively preserving output diversity.

Technology Category

Application Category

📝 Abstract

Test-time training (TTT) adapts model parameters on unlabeled test instances during inference time, which continuously extends capabilities beyond the reach of offline training. Despite initial gains, existing TTT methods for LRMs plateau quickly and do not benefit from additional test-time compute. Without external calibration, the self-generated reward signal increasingly drifts as the policy model evolves, leading to both performance plateaus and diversity collapse. We propose TEMPO, a TTT framework that interleaves policy refinement on unlabeled questions with periodic critic recalibration on a labeled dataset. By formalizing this alternating procedure through the Expectation-Maximization (EM) algorithm, we reveal that prior methods can be interpreted as incomplete variants that omit the crucial recalibration step. Reintroducing this step tightens the evidence lower bound (ELBO) and enables sustained improvement. Across diverse model families (Qwen3 and OLMO3) and reasoning tasks, TEMPO improves OLMO3-7B on AIME 2024 from 33.0% to 51.1% and Qwen3-14B from 42.3% to 65.8%, while maintaining high diversity.

Problem

Research questions and friction points this paper is trying to address.

Test-time Training

Large Reasoning Models

Reward Drift

Performance Plateau

Diversity Collapse

Innovation

Methods, ideas, or system contributions that make the work stand out.

Test-time Training

Expectation-Maximization

Critic Recalibration