SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

📅 2026-03-29
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of reward scarcity in open-domain reasoning, where ambiguous answers often preclude reliable supervision signals for reinforcement learning. To overcome this limitation, the authors propose Structure-Aware Reinforcement Learning (SARL), which innovatively incorporates the small-world properties of complex networks into the reinforcement learning framework. SARL constructs a topological Reasoning Map of inference paths and leverages its local clustering and global efficient connectivity as an unsupervised reward signal, thereby shifting the training paradigm from outcome-oriented to process-oriented. Implemented with PPO/GRPO algorithms on Qwen3-4B, SARL substantially outperforms existing baselines, achieving performance gains of 9.1%–11.6% on mathematical reasoning tasks and 30.4%–34.6% on open-domain tasks, while simultaneously exhibiting lower KL divergence and higher policy entropy.
📝 Abstract
Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.
Problem

Research questions and friction points this paper is trying to address.

reinforcement learning
label-free
reasoning topology
open-ended tasks
reasoning trajectories
Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-Aware Reinforcement Learning
Reasoning Topology
Label-Free RL
Small-World Network
Reasoning Map
🔎 Similar Papers
No similar papers found.