Test-Time Scaling with Reflective Generative Model

📅 2025-07-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the parameter redundancy and inference inefficiency arising from the architectural separation of policy models (PMs) and process reward models (PRMs). We propose the Self-supervised Process Reward Model (SPRM) framework, unifying PM and PRM into a single architecture. Key innovations include: (1) a shared backbone network; (2) task-specific output heads for policy generation and process-level reward estimation; and (3) a novel self-supervised learning paradigm for process rewards—requiring no human annotations. Leveraging SPRM, we introduce MetaStone-S1, the first reflective generation model supporting test-time scaling with controllable reasoning depth (low/medium/high modes). MetaStone-S1 reduces total parameters by over 99% compared to conventional dual-model approaches and establishes, for the first time, a quantifiable scaling law linking computational cost (i.e., reasoning steps) to performance. At the 32B parameter scale, it matches the performance of the OpenAI-o3-mini series. Both code and models are publicly released.

Technology Category

Application Category

📝 Abstract
We introduce our first reflective generative model MetaStone-S1, which obtains OpenAI o3's performance via the self-supervised process reward model (SPRM). Through sharing the backbone network and using task-specific heads for next token prediction and process scoring respectively, SPRM successfully integrates the policy model and process reward model(PRM) into a unified interface without extra process annotation, reducing over 99% PRM parameters for efficient reasoning. Equipped with SPRM, MetaStone-S1 is naturally suitable for test time scaling (TTS), and we provide three reasoning effort modes (low, medium, and high), based on the controllable thinking length. Moreover, we empirically establish a scaling law that reveals the relationship between total thinking computation and TTS performance. Experiments demonstrate that our MetaStone-S1 achieves comparable performance to OpenAI-o3-mini's series with only 32B parameter size. To support the research community, we have open-sourced MetaStone-S1 at https://github.com/MetaStone-AI/MetaStone-S1.
Problem

Research questions and friction points this paper is trying to address.

Develops a self-supervised reflective generative model for efficient reasoning
Integrates policy and reward models without extra annotation
Establishes scaling law for test-time performance and computation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-supervised process reward model reduces parameters
Test time scaling with controllable thinking length
Unified interface integrates policy and reward models
🔎 Similar Papers
2024-10-02ACM Conference on Recommender SystemsCitations: 0
Zixiao Wang
Zixiao Wang
University of Science and Technology of China
Y
Yuxin Wang
MetaStone-AI & USTC
Xiaorui Wang
Xiaorui Wang
Professor of Computer Engineering, The Ohio State University
Power ManagementData CentersReal-Time Embedded SystemsComputer ArchitectureComputer Systems
M
Mengting Xing
MetaStone-AI & USTC
J
Jie Gao
MetaStone-AI & USTC
Jianjun Xu
Jianjun Xu
MetaStone-AI & USTC
Guangcan Liu
Guangcan Liu
MetaStone-AI & USTC
C
Chenhui Jin
MetaStone-AI & USTC
Z
Zhuo Wang
MetaStone-AI & USTC
S
Shengzhuo Zhang
MetaStone-AI & USTC
H
Hongtao Xie
MetaStone-AI & USTC