Beyond Noisy-TVs: Noise-Robust Exploration Via Learning Progress Monitoring

📅 2025-09-29

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Intrinsic reward-based exploration in reinforcement learning fails under non-learnable stochastic noise (e.g., “noisy-TV”), as conventional uncertainty- or similarity-driven signals spuriously incentivize attention to irreducible noise. Method: We propose Learning Progress Monitoring (LPM), the first approach to use temporal improvement in dynamic model prediction error—i.e., model learning progress—as the intrinsic reward signal. LPM is theoretically proven to be monotonically positively correlated with information gain and inherently zero-equivariant, thus avoiding noise traps. It employs a dual-network architecture to estimate and reward only learnable state transitions, suppressing responses to noise. Results: On noisy benchmarks—including MNIST, 3D mazes, and Atari—LPM significantly improves sample efficiency and exploration breadth, accelerates convergence, and achieves superior downstream task performance compared to state-of-the-art methods based on uncertainty estimation or distributional similarity.

Technology Category

Application Category

📝 Abstract

When there exists an unlearnable source of randomness (noisy-TV) in the environment, a naively intrinsic reward driven exploring agent gets stuck at that source of randomness and fails at exploration. Intrinsic reward based on uncertainty estimation or distribution similarity, while eventually escapes noisy-TVs as time unfolds, suffers from poor sample efficiency and high computational cost. Inspired by recent findings from neuroscience that humans monitor their improvements during exploration, we propose a novel method for intrinsically-motivated exploration, named Learning Progress Monitoring (LPM). During exploration, LPM rewards model improvements instead of prediction error or novelty, effectively rewards the agent for observing learnable transitions rather than the unlearnable transitions. We introduce a dual-network design that uses an error model to predict the expected prediction error of the dynamics model in its previous iteration, and use the difference between the model errors of the current iteration and previous iteration to guide exploration. We theoretically show that the intrinsic reward of LPM is zero-equivariant and a monotone indicator of Information Gain (IG), and that the error model is necessary to achieve monotonicity correspondence with IG. We empirically compared LPM against state-of-the-art baselines in noisy environments based on MNIST, 3D maze with 160x120 RGB inputs, and Atari. Results show that LPM's intrinsic reward converges faster, explores more states in the maze experiment, and achieves higher extrinsic reward in Atari. This conceptually simple approach marks a shift-of-paradigm of noise-robust exploration. For code to reproduce our experiments, see https://github.com/Akuna23Matata/LPM_exploration

Problem

Research questions and friction points this paper is trying to address.

Addresses exploration failure in noisy environments with unlearnable randomness

Proposes learning progress monitoring to reward model improvements over novelty

Enhances sample efficiency and computational cost in noisy exploration tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

Learning Progress Monitoring rewards model improvements

Dual-network design predicts dynamics model errors

Zero-equivariant intrinsic reward indicates Information Gain

🔎 Similar Papers

No similar papers found.

Authors to Follow