Chunking the Critic: A Transformer-based Soft Actor-Critic with N-Step Returns

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

To address the high variance and training instability induced by *n*-step returns in Soft Actor-Critic (SAC), and the coarse value estimation arising from action chunking applied solely to the actor, this paper proposes **Chunked-TSAC**: a novel paradigm that explicitly encodes and feeds action chunks into the critic network. Crucially, it is the first to integrate Transformer-based sequence modeling with *n*-step temporal-difference returns—enabling low-variance, highly robust value learning without importance sampling. The core innovation lies in migrating action chunking from the actor to the critic, empowering the critic to jointly model intra-chunk temporal dependencies and inter-chunk long-horizon returns. Experiments demonstrate that Chunked-TSAC significantly outperforms standard SAC and leading variants on sparse-reward and multi-stage tasks, achieving faster convergence, enhanced training stability, and superior final policy performance.

Technology Category

Application Category

📝 Abstract

Soft Actor-Critic (SAC) critically depends on its critic network, which typically evaluates a single state-action pair to guide policy updates. Using N-step returns is a common practice to reduce the bias in the target values of the critic. However, using N-step returns can again introduce high variance and necessitates importance sampling, often destabilizing training. Recent algorithms have also explored action chunking-such as direct action repetition and movement primitives-to enhance exploration. In this paper, we propose a Transformer-based Critic Network for SAC that integrates the N-returns framework in a stable and efficient manner. Unlike approaches that perform chunking in the actor network, we feed chunked actions into the critic network to explore potential performance gains. Our architecture leverages the Transformer's ability to process sequential information, facilitating more robust value estimation. Empirical results show that this method not only achieves efficient, stable training but also excels in sparse reward/multi-phase environments-traditionally a challenge for step-based methods. These findings underscore the promise of combining Transformer-based critics with N-returns to advance reinforcement learning performance

Problem

Research questions and friction points this paper is trying to address.

Reduces bias and variance in critic network evaluations.

Integrates N-step returns for stable and efficient training.

Enhances performance in sparse reward and multi-phase environments.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Transformer-based Critic Network for SAC

Integrates N-step returns stably and efficiently

Feeds chunked actions into critic network

🔎 Similar Papers

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL