RLFR: Extending Reinforcement Learning for LLMs with Flow Environment

πŸ“… 2025-10-11
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Existing RL-based verifiable reasoning (RLVR) methods rely on binary verification signals, often overlooking valuable exploratory steps in reasoning traces; gold-standard process reward models (PRMs) incur prohibitive annotation costs, while auxiliary signals (e.g., entropy or likelihood) yield limited reward shaping efficacy. This paper proposes RLFRβ€”a novel framework that, for the first time, models the policy’s latent state space as a continuous vector field environment. RLFR constructs this field using offline expert demonstrations and online rejection sampling, then defines a differentiable, fine-grained process reward signal based on the velocity deviation of latent states within the field. Crucially, RLFR eliminates the need for manually annotated PRMs and effectively captures context-dependent reasoning dynamics. Evaluated on both language and multimodal reasoning benchmarks, RLFR consistently improves model performance over strong baselines. These results validate latent-space vector field modeling as an effective and generalizable paradigm for process reward generation.

Technology Category

Application Category

πŸ“ Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a promising framework for improving reasoning abilities in Large Language Models (LLMs). However, policy optimized with binary verification prone to overlook potential valuable exploration in reasoning trajectory. In view of heavy annotation cost of golden Process Reward Models (PRMs), recent works attempt using auxiliary signals for reward shaping of process tokens, involving entropy and likelihood collected from logit space. In this work, we offer a novel perspective on shaping RLVR with flow rewards derived from latent space, and propose RLFR, where the flow fields of model latents are constructed from either off-policy high-quality data and on-policy rejection sampling data, and the velocity deviations of policy latents within it are quantified to serve as a reward signal. RLFR first demonstrates that a well-established flow field can be a sound environment for reward signal collection, highlighting the expressive latent space is much underexplored. Moreover, RLFR is able to compress any off-policy expert data as reference for constituting reward signals, and we show that the efficient context dependence compressed within the hidden states are utilized, rather than individual token-level denotation for context comprehending. Experiments on both language and multimodal reasoning benchmarks demonstrate the reliability of flow rewards, and suggesting a promising paradigm for reward shaping with auxiliary signals.
Problem

Research questions and friction points this paper is trying to address.

Improving reasoning in LLMs with latent space flow rewards
Addressing binary verification limitations in RLVR framework
Utilizing off-policy data for efficient reward signal generation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Flow rewards derived from latent space
Velocity deviations of policy latents quantified
Compress off-policy expert data as reference
πŸ”Ž Similar Papers
No similar papers found.
Jinghao Zhang
Jinghao Zhang
Kuaishou Tech
Recommender SystemsMultimediaLarge Language Model
N
Naishan Zheng
University of Science and Technology of China
R
Ruilin Li
Shanghai Innovation Institute
D
Dongzhou Cheng
Shanghai Innovation Institute
Z
Zheming Liang
University of Science and Technology of China
F
Feng Zhao
University of Science and Technology of China
J
Jiaqi Wang
Shanghai Innovation Institute