Bootstrapping Post-training Signals for Open-ended Tasks via Rubric-based Self-play on Pre-training Text

πŸ“… 2026-04-21
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

175K/year
πŸ€– AI Summary
Existing self-play approaches are confined to verifiable tasks and rely heavily on high-quality human-annotated data, limiting their applicability in open-domain settings. This work proposes the POP framework, which introduces, for the first time, a scoring-rule-based self-play mechanism into open-domain post-training. POP leverages a single language model to automatically generate input–output pairs along with corresponding scoring rules from pretraining corpora, which are then used to construct a reward model that drives reinforcement learning. By deliberately inducing a semantic gap between generation and verification, the method effectively mitigates reward hacking and mode collapse. Experiments on Qwen-2.5-7B demonstrate that POP substantially enhances model performance across diverse open-domain tasks, including medical question answering, creative writing, and instruction following.

Technology Category

Application Category

πŸ“ Abstract
Self-play has recently emerged as a promising paradigm to train Large Language Models (LLMs). In self-play, the target LLM creates the task input (e.g., ask a question), which it then addresses itself by producing a task output (e.g., give an answer). A reward model evaluates the output, and the rewards are then used to train the LLM, typically via Reinforcement Learning (RL). Self-play incurs minimal supervision costs, and this is especially helpful for post-training LLMs, which require high-quality input-output pairs that traditionally have to be written by humans or expensive proprietary models. However, existing work explores self-play only for verifiable tasks such as math and coding. Instead, we seek to extend it to more realistic open-ended tasks. In particular, we propose POP, a self-play framework that uses the same LLM to synthesize evaluation rubrics, along with input-output pairs, for each example. The rubric is then used to evaluate outputs and train the model. We further ground the framework on a content-rich pretraining corpus to (1) ensure a generation-verification gap and reduce reward hacking, and (2) prevent mode collapse. On Qwen-2.5-7B, POP increases performance of both pretrained and instruction-tuned models, across different tasks ranging from long-form Healthcare QA to creative writing and instruction following.
Problem

Research questions and friction points this paper is trying to address.

self-play
open-ended tasks
Large Language Models
post-training
reward evaluation
Innovation

Methods, ideas, or system contributions that make the work stand out.

self-play
rubric-based evaluation
open-ended tasks
reward hacking mitigation
pretraining corpus grounding
πŸ”Ž Similar Papers
No similar papers found.