SARL: Label-Free Reinforcement Learning by Rewarding Reasoning Topology

📅 2026-03-29

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses the challenge of reward scarcity in open-domain reasoning, where ambiguous answers often preclude reliable supervision signals for reinforcement learning. To overcome this limitation, the authors propose Structure-Aware Reinforcement Learning (SARL), which innovatively incorporates the small-world properties of complex networks into the reinforcement learning framework. SARL constructs a topological Reasoning Map of inference paths and leverages its local clustering and global efficient connectivity as an unsupervised reward signal, thereby shifting the training paradigm from outcome-oriented to process-oriented. Implemented with PPO/GRPO algorithms on Qwen3-4B, SARL substantially outperforms existing baselines, achieving performance gains of 9.1%–11.6% on mathematical reasoning tasks and 30.4%–34.6% on open-domain tasks, while simultaneously exhibiting lower KL divergence and higher policy entropy.

Technology Category

Application Category

📝 Abstract

Reinforcement learning has become central to improving large reasoning models, but its success still relies heavily on verifiable rewards or labeled supervision. This limits its applicability to open ended domains where correctness is ambiguous and cannot be verified. Moreover, reasoning trajectories remain largely unconstrained, and optimization towards final answer can favor early exploitation over generalization. In this work, we ask whether general reasoning ability can be improved by teaching models how to think (the structure of reasoning) rather than what to produce (the outcome of reasoning) and extend traditional RLVR to open ended settings. We introduce structure aware reinforcement learning (SARL), a label free framework that constructs a per response Reasoning Map from intermediate thinking steps and rewards its small world topology, inspired by complex networks and the functional organization of the human brain. SARL encourages reasoning trajectories that are both locally coherent and globally efficient, shifting supervision from destination to path. Our experiments on Qwen3-4B show SARL surpasses ground truth based RL and prior label free RL baselines, achieving the best average gain of 9.1% under PPO and 11.6% under GRPO on math tasks and 34.6% under PPO and 30.4% under GRPO on open ended tasks. Beyond good performance, SARL also exhibits lower KL divergence, higher policy entropy, indicating a more stable and exploratory training and generalized reasoning ability.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

label-free

reasoning topology

open-ended tasks

reasoning trajectories

Innovation

Methods, ideas, or system contributions that make the work stand out.

Structure-Aware Reinforcement Learning

Reasoning Topology

Label-Free RL