Value-Based Deep RL Scales Predictably

📅 2025-02-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the critical challenge of unpredictable scalability in offline policy optimization within deep reinforcement learning. We first uncover systematic scaling laws for value-based RL algorithms—including SAC, BRO, and PQL—under increasing data volume and computational resources. To formalize this behavior, we propose a Pareto-front modeling framework governed by the update-to-data (UTD) ratio, establishing quantitative relationships among dataset size, compute budget, and policy performance. Furthermore, we design a budget-aware hyperparameter auto-configuration mechanism that jointly mitigates overfitting and plasticity decay. Extensive evaluation across DeepMind Control, OpenAI Gym, and IsaacGym demonstrates that our method enables high-fidelity extrapolation from small-scale experiments to larger-scale deployments, achieving significantly lower prediction error than existing baselines. Our results establish that value-based offline RL exhibits robust, analytically tractable, and modelable scaling behavior.

Technology Category

Application Category

📝 Abstract
Scaling data and compute is critical to the success of machine learning. However, scaling demands predictability: we want methods to not only perform well with more compute or data, but also have their performance be predictable from small-scale runs, without running the large-scale experiment. In this paper, we show that value-based off-policy RL methods are predictable despite community lore regarding their pathological behavior. First, we show that data and compute requirements to attain a given performance level lie on a Pareto frontier, controlled by the updates-to-data (UTD) ratio. By estimating this frontier, we can predict this data requirement when given more compute, and this compute requirement when given more data. Second, we determine the optimal allocation of a total resource budget across data and compute for a given performance and use it to determine hyperparameters that maximize performance for a given budget. Third, this scaling behavior is enabled by first estimating predictable relationships between hyperparameters, which is used to manage effects of overfitting and plasticity loss unique to RL. We validate our approach using three algorithms: SAC, BRO, and PQL on DeepMind Control, OpenAI gym, and IsaacGym, when extrapolating to higher levels of data, compute, budget, or performance.
Problem

Research questions and friction points this paper is trying to address.

Predictable scaling in value-based RL
Optimal resource allocation for RL
Hyperparameter relationship estimation in RL
Innovation

Methods, ideas, or system contributions that make the work stand out.

Value-based off-policy RL predictability
Pareto frontier for data-compute scaling
Optimal resource budget allocation
🔎 Similar Papers
No similar papers found.