Emergent temporal abstractions in autoregressive models enable hierarchical reinforcement learning

📅 2025-12-23

📈 Citations: 0

✨ Influential: 0

career value

196K/year

🤖 AI Summary

Autoregressive reinforcement learning (RL) suffers from inefficient exploration under sparse rewards due to its token-by-token decision-making process. Method: We propose *Internal RL*, a novel paradigm that constructs a non-causal, higher-order sequence model within the latent space of autoregressive models. By applying residual stream interventions, it learns an implicit controller endowed with temporal abstraction and composable termination conditions—enabling hierarchical action representation without external architectural hierarchy. Contribution/Results: Internal RL allows foundation models to autonomously emerge semantically coherent, long-horizon executable high-level policies. Evaluated on grid-world and MuJoCo hierarchical tasks, it significantly outperforms standard RL fine-tuning—which fails entirely under sparse rewards—while maintaining stable convergence. This demonstrates the effectiveness and generality of implicit action generation and in-representation-space RL for hierarchical decision-making.

Technology Category

Application Category

📝 Abstract

Large-scale autoregressive models pretrained on next-token prediction and finetuned with reinforcement learning (RL) have achieved unprecedented success on many problem domains. During RL, these models explore by generating new outputs, one token at a time. However, sampling actions token-by-token can result in highly inefficient learning, particularly when rewards are sparse. Here, we show that it is possible to overcome this problem by acting and exploring within the internal representations of an autoregressive model. Specifically, to discover temporally-abstract actions, we introduce a higher-order, non-causal sequence model whose outputs control the residual stream activations of a base autoregressive model. On grid world and MuJoCo-based tasks with hierarchical structure, we find that the higher-order model learns to compress long activation sequence chunks onto internal controllers. Critically, each controller executes a sequence of behaviorally meaningful actions that unfold over long timescales and are accompanied with a learned termination condition, such that composing multiple controllers over time leads to efficient exploration on novel tasks. We show that direct internal controller reinforcement, a process we term "internal RL", enables learning from sparse rewards in cases where standard RL finetuning fails. Our results demonstrate the benefits of latent action generation and reinforcement in autoregressive models, suggesting internal RL as a promising avenue for realizing hierarchical RL within foundation models.

Problem

Research questions and friction points this paper is trying to address.

Enable hierarchical reinforcement learning via temporal abstractions

Overcome inefficient token-by-token exploration in sparse reward tasks

Learn latent controllers for long-term action sequences from internal representations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Higher-order model controls autoregressive residual activations

Internal controllers compress long action sequences for efficiency

Internal RL enables sparse reward learning in autoregressive models

🔎 Similar Papers

Hierarchical Reinforcement Learning for Temporal Abstraction of Listwise Recommendation