floq: Training Critics via Flow-Matching for Scaling Compute in Value-Based RL

📅 2025-09-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Conventional temporal-difference (TD) methods rely on monolithic Q-networks, limiting fine-grained control over the capacity and representational power of value functions. Method: This paper introduces flow matching—a continuous-time generative modeling technique—into value-based reinforcement learning for the first time. We propose an iterative Q-function parameterization framework that models Q-value evolution via a velocity field; Q-values are estimated progressively through multi-step numerical integration, and TD targets are derived from a learned target velocity field. Crucially, computational capacity is dynamically scalable by adjusting the number of integration steps, circumventing architectural rigidity inherent in fixed-structure networks. Contribution/Results: Our approach achieves an average 1.8× performance improvement over standard TD baselines across diverse offline RL benchmarks—including both discrete and continuous control tasks—as well as online fine-tuning scenarios, demonstrating substantial gains in expressivity and empirical effectiveness.

Technology Category

Application Category

📝 Abstract
A hallmark of modern large-scale machine learning techniques is the use of training objectives that provide dense supervision to intermediate computations, such as teacher forcing the next token in language models or denoising step-by-step in diffusion models. This enables models to learn complex functions in a generalizable manner. Motivated by this observation, we investigate the benefits of iterative computation for temporal difference (TD) methods in reinforcement learning (RL). Typically they represent value functions in a monolithic fashion, without iterative compute. We introduce floq (flow-matching Q-functions), an approach that parameterizes the Q-function using a velocity field and trains it using techniques from flow-matching, typically used in generative modeling. This velocity field underneath the flow is trained using a TD-learning objective, which bootstraps from values produced by a target velocity field, computed by running multiple steps of numerical integration. Crucially, floq allows for more fine-grained control and scaling of the Q-function capacity than monolithic architectures, by appropriately setting the number of integration steps. Across a suite of challenging offline RL benchmarks and online fine-tuning tasks, floq improves performance by nearly 1.8x. floq scales capacity far better than standard TD-learning architectures, highlighting the potential of iterative computation for value learning.
Problem

Research questions and friction points this paper is trying to address.

Scaling value-based reinforcement learning with iterative computation
Improving Q-function capacity via flow-matching techniques
Enhancing TD-learning performance through velocity field parameterization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Parameterizes Q-function using flow-matching velocity field
Trains velocity field with TD-learning and numerical integration
Scales capacity by adjusting integration steps dynamically
🔎 Similar Papers
No similar papers found.