Test-Time Scaling with Reflective Generative Model

๐Ÿ“… 2025-07-02
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

178K/year
๐Ÿค– AI Summary
This work addresses the parameter redundancy and inference inefficiency arising from the architectural separation of policy models (PMs) and process reward models (PRMs). We propose the Self-supervised Process Reward Model (SPRM) framework, unifying PM and PRM into a single architecture. Key innovations include: (1) a shared backbone network; (2) task-specific output heads for policy generation and process-level reward estimation; and (3) a novel self-supervised learning paradigm for process rewardsโ€”requiring no human annotations. Leveraging SPRM, we introduce MetaStone-S1, the first reflective generation model supporting test-time scaling with controllable reasoning depth (low/medium/high modes). MetaStone-S1 reduces total parameters by over 99% compared to conventional dual-model approaches and establishes, for the first time, a quantifiable scaling law linking computational cost (i.e., reasoning steps) to performance. At the 32B parameter scale, it matches the performance of the OpenAI-o3-mini series. Both code and models are publicly released.

Technology Category

Application Category

๐Ÿ“ Abstract
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
Problem

Research questions and friction points this paper is trying to address.

Develops a self-supervised reflective generative model for efficient reasoning
Integrates policy and reward models without extra annotation
Establishes scaling law for test-time performance and computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised process reward model reduces parameters
Test time scaling with controllable thinking length
Unified interface integrates policy and reward models
๐Ÿ”Ž Similar Papers
2024-10-02ACM Conference on Recommender SystemsCitations: 0