Native Reasoning Models: Training Language Models to Reason on Unverifiable Data

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

210K/year

🤖 AI Summary

This work proposes NRT, a novel framework that enables end-to-end reasoning training for large language models without relying on expert demonstrations or external verifiers—addressing the high cost, potential bias, and inapplicability to unverifiable tasks inherent in existing approaches. By modeling reasoning paths as latent variables, NRT establishes a self-reinforcement learning loop through self-generated trajectories and an intrinsic reward mechanism, all optimized under a unified objective. The framework systematically mitigates policy collapse and introduces a more robust reward aggregation strategy. Experiments demonstrate that NRT significantly outperforms standard supervised fine-tuning and current verifier-free reinforcement learning methods on Llama and Mistral models, achieving state-of-the-art performance while effectively generalizing to complex and unverifiable reasoning tasks.

Technology Category

Application Category

📝 Abstract

The prevailing paradigm for training large reasoning models--combining Supervised Fine-Tuning (SFT) with Reinforcement Learning with Verifiable Rewards (RLVR)--is fundamentally constrained by its reliance on high-quality, human-annotated reasoning data and external verifiers. This dependency incurs significant data-collection costs, risks embedding human cognitive biases, and confines the reinforcement learning stage to objectively assessable domains like mathematics and coding, leaving a wide range of unverifiable tasks beyond its scope. To overcome these limitations, we introduce NRT (Native Reasoning Training), a novel framework that cultivates complex reasoning by having the model generate its own reasoning traces using only standard question-answer pairs, thereby obviating the need for expert-written demonstrations. NRT reframes the training problem by treating the reasoning process as a latent variable. It employs a unified training objective that models reasoning as an optimization problem, intrinsically rewarding paths that increase the model's likelihood of producing the ground-truth answer. This unified perspective allows us to analyze intrinsic failure modes of prior methods, such as policy collapse, and systematically design more robust reward aggregation functions, creating a self-reinforcing feedback loop where the model learns to think in ways that resolve its own uncertainty. Empirical evaluation on Llama and Mistral model families demonstrates that NRT achieves state-of-the-art performance among verifier-free methods, significantly outperforming standard SFT baselines and prior verifier-free RL methods. Our approach yields particularly strong performance gains in complex reasoning domains and exhibits high robustness to policy collapse, offering a general, scalable path toward building more powerful and broadly applicable reasoning systems.

Problem

Research questions and friction points this paper is trying to address.

reasoning models

unverifiable data

reinforcement learning

human-annotated data

verifier-free

Innovation

Methods, ideas, or system contributions that make the work stand out.

Native Reasoning Training

verifier-free reasoning

latent reasoning