Actor-Critic with Active Importance Sampling

πŸ“… 2026-05-07
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

199K/year
πŸ€– AI Summary
This work proposes Active Importance Sampling Actor-Critic (AISAC), a novel algorithm addressing the high variance inherent in policy gradient methods, which often leads to inefficient learning and unstable training. AISAC treats the behavior policy as a learnable component and dynamically adapts its distribution via importance sampling to align with the target policy gradient direction, thereby minimizing estimator variance while preserving unbiasedness. Leveraging a Gaussian behavior policy optimized through cross-entropy minimization, the method enables efficient policy updates and value estimation in continuous control tasks. Empirical results demonstrate that AISAC significantly improves learning speed, sample efficiency, and training stability on benchmarks such as Inverted Pendulum and Half Cheetah, while exhibiting strong robustness to hyperparameter variations.
πŸ“ Abstract
This paper introduces the Active-Importance-Sampling Actor-Critic (AISAC) algorithm, an extension of the Actor-Critic framework for reducing variance in policy gradient estimation. AISAC optimizes the behavior policy to minimize gradient variance while preserving unbiased gradient estimates. Using importance sampling principles, the algorithm adapts the behavior policy toward efficient data collection distributions aligned with target policy gradients. For continuous action spaces, AISAC employs Gaussian behavior policies optimized through cross-entropy minimization. We provide theoretical analysis demonstrating variance reduction and unbiasedness. Experiments on Inverted Pendulum and Half Cheetah tasks show improved learning speed, sample efficiency, and training stability compared to standard Actor-Critic methods. Results indicate that optimizing the behavior policy improves both target policy updates and critic estimation accuracy across different hyperparameter settings. AISAC accelerates convergence and stabilizes reinforcement learning training, making it promising for real-world applications. Future work includes integration with advanced algorithms such as Soft Actor-Critic and TD3 for more complex environments.
Problem

Research questions and friction points this paper is trying to address.

policy gradient variance
Actor-Critic
importance sampling
sample efficiency
training stability
Innovation

Methods, ideas, or system contributions that make the work stand out.

Actor-Critic
Importance Sampling
Variance Reduction
Behavior Policy Optimization
Policy Gradient
πŸ”Ž Similar Papers
No similar papers found.