R2R2: Robust Representation for Intensive Experience Reuse via Redundancy Reduction in Self-Predictive Learning

๐Ÿ“… 2026-05-13
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF

career value

212K/year
๐Ÿค– AI Summary
This work addresses the instability and overfitting in self-predictive learning under data-scarce regimesโ€”such as real-world roboticsโ€”where aggressive experience reuse (e.g., high update-to-data ratios, UTD) exacerbates representation collapse. The authors propose R2R2, a representation regularization method that mitigates redundancy in learned features. They further reveal, for the first time, an inherent conflict between standard zero-centered objectives and the spectral properties of self-predictive learning, leading to the design of a non-centered regularization target that significantly improves training stability at high UTD ratios. R2R2 integrates seamlessly into algorithms like TD7 and SimbaV2, yielding consistent gains: it boosts TD7 performance by approximately 22% at UTD=20 across 11 continuous control tasks and establishes a new state-of-the-art on the newly introduced SimbaV2-SPL benchmark, demonstrating substantial and orthogonal improvements.
๐Ÿ“ Abstract
For reinforcement learning in data-scarce domains like real-world robotics, intensive data reuse enhances efficiency but induces overfitting. While prior works focus on critic bias, representation-level instability in Self-Predictive Learning (SPL) under high Update-to-Data (UTD) regimes remains underexplored. To bridge this gap, we propose Robust Representation via Redundancy Reduction (R2R2), a regularization method within SPL. We theoretically identify that standard zero-centering conflicts with SPL's spectral properties and design a non-centered objective accordingly. We verify R2R2 on SPL-native algorithms like TD7. Furthermore, to demonstrate its orthogonality to prior advancements, we extend the state-of-the-art SimbaV2, which originally lacks SPL, by integrating a tailored SPL module, termed SimbaV2-SPL. Experiments across 11 continuous control tasks confirm that R2R2 effectively mitigates overfitting; specifically, at a UTD ratio of 20, it improves TD7 by ~22% and provides additional gains on top of SimbaV2-SPL, which itself establishes a new state-of-the-art. The code can be found at: https://github.com/songsang7/R2R2
Problem

Research questions and friction points this paper is trying to address.

overfitting
Self-Predictive Learning
representation instability
data reuse
Update-to-Data ratio
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-Predictive Learning
Redundancy Reduction
Representation Regularization
High UTD
Overfitting Mitigation