Fill in the Blanks: Accelerating Q-Learning with a Handful of Demonstrations in Sparse Reward Settings

📅 2025-10-28

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

In sparse-reward environments, reinforcement learning suffers from inefficient exploration and poor sample efficiency due to infrequent and delayed feedback. To address this, we propose a value function initialization method that leverages a small number of (even suboptimal) successful demonstrations to estimate state-action values offline; these estimates serve as informative priors for online Q-learning, forming a lightweight “offline warm-start + online fine-tuning” paradigm. Our approach requires no architectural modifications or auxiliary modules, substantially reducing early-stage exploration burden. On standard benchmark tasks, it achieves significantly faster convergence compared to baseline algorithms. Moreover, it demonstrates strong robustness to variations in both the quantity and quality of demonstrations. Empirical results validate the effectiveness and practicality of value priors in sparse-reward settings, offering a simple yet powerful mechanism to improve learning efficiency without increasing model complexity.

Technology Category

Application Category

📝 Abstract

Reinforcement learning (RL) in sparse-reward environments remains a significant challenge due to the lack of informative feedback. We propose a simple yet effective method that uses a small number of successful demonstrations to initialize the value function of an RL agent. By precomputing value estimates from offline demonstrations and using them as targets for early learning, our approach provides the agent with a useful prior over promising actions. The agent then refines these estimates through standard online interaction. This hybrid offline-to-online paradigm significantly reduces the exploration burden and improves sample efficiency in sparse-reward settings. Experiments on benchmark tasks demonstrate that our method accelerates convergence and outperforms standard baselines, even with minimal or suboptimal demonstration data.

Problem

Research questions and friction points this paper is trying to address.

Accelerates Q-learning using sparse reward demonstrations

Initializes value functions with minimal demonstration data

Reduces exploration burden in sparse-reward environments

Innovation

Methods, ideas, or system contributions that make the work stand out.

Initializes value function using few demonstrations

Precomputes value estimates from offline data

Combines offline pre-training with online refinement

🔎 Similar Papers

No similar papers found.

Authors to Follow