Surrogate Signals from Format and Length: Reinforcement Learning for Solving Mathematical Problems without Ground Truth Answers

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the scarcity, high cost, or unavailability of ground-truth annotations for mathematical reasoning tasks, this paper proposes an unsupervised reinforcement learning framework that requires no labeled answers. The method leverages format correctness and length consistency of model outputs as proxy reward signals—modeling format rewards via GRPO and designing a dynamic length-consistency reward—while fine-tuning only a 7B-parameter LLM to regularize output behavior. Crucially, experiments reveal that base models already possess latent reasoning capabilities; performance gains arise primarily from aligning outputs with answer-format conventions rather than acquiring new reasoning knowledge. On the AIME2024 benchmark, the approach achieves 40.0% accuracy—substantially outperforming supervised baselines—demonstrating the effectiveness, generalizability, and practicality of label-free training for mathematical reasoning.

Technology Category

Application Category

📝 Abstract
Large Language Models have achieved remarkable success in natural language processing tasks, with Reinforcement Learning playing a key role in adapting them to specific applications. However, obtaining ground truth answers for training LLMs in mathematical problem-solving is often challenging, costly, and sometimes unfeasible. This research delves into the utilization of format and length as surrogate signals to train LLMs for mathematical problem-solving, bypassing the need for traditional ground truth answers.Our study shows that a reward function centered on format correctness alone can yield performance improvements comparable to the standard GRPO algorithm in early phases. Recognizing the limitations of format-only rewards in the later phases, we incorporate length-based rewards. The resulting GRPO approach, leveraging format-length surrogate signals, not only matches but surpasses the performance of the standard GRPO algorithm relying on ground truth answers in certain scenarios, achieving 40.0% accuracy on AIME2024 with a 7B base model. Through systematic exploration and experimentation, this research not only offers a practical solution for training LLMs to solve mathematical problems and reducing the dependence on extensive ground truth data collection, but also reveals the essence of why our label-free approach succeeds: base model is like an excellent student who has already mastered mathematical and logical reasoning skills, but performs poorly on the test paper, it simply needs to develop good answering habits to achieve outstanding results in exams , in other words, to unlock the capabilities it already possesses.
Problem

Research questions and friction points this paper is trying to address.

Training LLMs for math without ground truth answers
Using format and length as surrogate training signals
Reducing dependence on costly labeled data collection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses format and length as surrogate signals
Combines format and length-based rewards
Reduces dependence on ground truth data
🔎 Similar Papers
No similar papers found.
R
Rihui Xin
Baichuan Inc.
H
Han Liu
Baichuan Inc., Tsinghua University
Z
Zecheng Wang
Baichuan Inc., Harbin Institute of Technology
Y
Yupeng Zhang
Baichuan Inc.
Dianbo Sui
Dianbo Sui
Harbin Institute of Technology
X
Xiaolin Hu
Tsinghua University
Bingning Wang
Bingning Wang
Baichuan Inc.
NLPQuestion AnsweringLarge language model