Temporal Difference Learning with Constrained Initial Representations

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

169K/year

🤖 AI Summary

This work addresses the training instability and poor sample efficiency in off-policy reinforcement learning caused by distributional shift in initial representations. To this end, the paper proposes the CIR framework, which systematically introduces an initial representation constraint mechanism into temporal difference learning for the first time. The approach employs a Tanh activation function at the input layer to bound representations and integrates feature normalization, skip connections, and convex Q-learning to construct a stable and efficient training architecture. Theoretical analysis establishes the convergence of the proposed method, and empirical evaluations demonstrate that CIR matches or even surpasses state-of-the-art strong baselines across multiple continuous control tasks, significantly improving both training stability and sample efficiency.

Technology Category

Application Category

📝 Abstract

Recently, there have been numerous attempts to enhance the sample efficiency of off-policy reinforcement learning (RL) agents when interacting with the environment, including architecture improvements and new algorithms. Despite these advances, they overlook the potential of directly constraining the initial representations of the input data, which can intuitively alleviate the distribution shift issue and stabilize training. In this paper, we introduce the Tanh function into the initial layer to fulfill such a constraint. We theoretically unpack the convergence property of the temporal difference learning with the Tanh function under linear function approximation. Motivated by theoretical insights, we present our Constrained Initial Representations framework, tagged CIR, which is made up of three components: (i) the Tanh activation along with normalization methods to stabilize representations; (ii) the skip connection module to provide a linear pathway from the shallow layer to the deep layer; (iii) the convex Q-learning that allows a more flexible value estimate and mitigates potential conservatism. Empirical results show that CIR exhibits strong performance on numerous continuous control tasks, even being competitive or surpassing existing strong baseline methods.

Problem

Research questions and friction points this paper is trying to address.

off-policy reinforcement learning

distribution shift

initial representations

temporal difference learning

sample efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Constrained Initial Representations

Tanh activation

Temporal Difference Learning