π€ AI Summary
This work addresses the challenge in reinforcement learning of designing exploration rewards that effectively balance lifelong experience accumulation with redundancy reduction within individual trajectories, a problem often exacerbated by reliance on heuristic weighting or limitations to low-dimensional state spaces. The authors propose Conditional Information Gain (CIG), a scalable intrinsic reward mechanism that constructs a log-determinant objective based on ensemble model disagreement and leverages Cholesky decomposition to produce causal, stepwise rewards incorporating both replay buffer contents and trajectory prefixes. CIG is the first method to jointly model lifelong and intra-trajectory information without requiring Gaussian process assumptions, enabling applicability to high-dimensional environments. Evaluated across twelve tasks from MiniGrid and OGBench, CIG consistently matches or outperforms existing approaches under both clean and randomly perturbed conditions, demonstrating strong robustness and generalization.
π Abstract
Intrinsic rewards for exploration in reinforcement learning condition on different contexts: lifelong rewards score each transition against accumulated experience but ignore within-rollout redundancy; episodic rewards penalize intra-trajectory repetition but discard lifetime progress. Hybrid methods combine both signals through heuristic weights or require Gaussian-process dynamics that do not scale beyond low-dimensional state spaces. Trajectory-level information gain decomposes into per-step terms that condition on the replay buffer and rollout prefix simultaneously, but remains intractable for deep models. We derive the Conditional Information Gain (CIG) reward as a tractable surrogate: a log-determinant objective over an ensemble disagreement kernel whose Cholesky factorization yields causal per-step rewards that retain both conditioning sets while scaling to high-dimensional state spaces. We instantiate CIG in a model-based setting, where rollouts are short and within-rollout corrections remain largely unexplored. Across twelve tasks spanning discrete (MiniGrid) and continuous control (OGBench), in both clean and stochastic-distractor settings, CIG outperforms or matches prior exploration methods while remaining robust to stochastic distractors.