Multi-Fidelity Policy Gradient Algorithms

📅 2025-03-07

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

In high-cost, low-interaction settings, reinforcement learning (RL) suffers from strong dependence on high-fidelity data. To address this, we propose the Multi-Fidelity Policy Gradient (MFPG) framework—the first method enabling unbiased, low-variance policy gradient estimation by jointly leveraging heterogeneous low-fidelity simulators (e.g., reduced-order models, generative world models) and scarce real-world interactions. MFPG integrates control variates with REINFORCE or PPO to enable efficient sim-to-real transfer, maintaining robustness even under substantial fidelity gaps between low-fidelity simulations and the real environment. On robotic simulation benchmarks, MFPG achieves up to 3.9× higher reward under sample constraints and significantly improves training stability. Notably, it matches or surpasses baseline methods—even when those baselines are granted ten times more real-world interactions—demonstrating a substantial reduction in reliance on high-fidelity data.

Technology Category

Application Category

📝 Abstract

Many reinforcement learning (RL) algorithms require large amounts of data, prohibiting their use in applications where frequent interactions with operational systems are infeasible, or high-fidelity simulations are expensive or unavailable. Meanwhile, low-fidelity simulators--such as reduced-order models, heuristic reward functions, or generative world models--can cheaply provide useful data for RL training, even if they are too coarse for direct sim-to-real transfer. We propose multi-fidelity policy gradients (MFPGs), an RL framework that mixes a small amount of data from the target environment with a large volume of low-fidelity simulation data to form unbiased, reduced-variance estimators (control variates) for on-policy policy gradients. We instantiate the framework by developing multi-fidelity variants of two policy gradient algorithms: REINFORCE and proximal policy optimization. Experimental results across a suite of simulated robotics benchmark problems demonstrate that when target-environment samples are limited, MFPG achieves up to 3.9x higher reward and improves training stability when compared to baselines that only use high-fidelity data. Moreover, even when the baselines are given more high-fidelity samples--up to 10x as many interactions with the target environment--MFPG continues to match or outperform them. Finally, we observe that MFPG is capable of training effective policies even when the low-fidelity environment is drastically different from the target environment. MFPG thus not only offers a novel paradigm for efficient sim-to-real transfer but also provides a principled approach to managing the trade-off between policy performance and data collection costs.

Problem

Research questions and friction points this paper is trying to address.

Reduces data dependency in reinforcement learning applications.

Combines low and high-fidelity data for efficient training.

Improves policy performance with limited target-environment samples.

Innovation

Methods, ideas, or system contributions that make the work stand out.

MFPG combines low and high-fidelity data

Uses control variates for unbiased estimators

Enhances RL training efficiency and stability

🔎 Similar Papers

No similar papers found.