Heuristic Transformer: Belief Augmented In-Context Reinforcement Learning

📅 2025-11-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited decision-making capability of Transformer-based models in context-based reinforcement learning under zero-parameter-update constraints. We propose the Heuristic Transformer, whose core innovation lies in explicitly modeling the reward belief distribution as a low-dimensional latent random variable. Specifically, we employ a variational autoencoder (VAE) to infer the posterior distribution over rewards from historical trajectories and inject this structured belief representation as a prompt into the Transformer’s decoding process—enabling belief-augmented contextual learning. Crucially, our method requires no parameter fine-tuning; policy inference relies solely on contextual demonstrations, the current state, and the learned belief prompt. Evaluated across diverse environments—including Darkroom, Miniworld, and MuJoCo—our approach significantly outperforms existing context-based RL baselines, demonstrating superior task adaptability and cross-environment generalization.

Technology Category

Application Category

📝 Abstract
Transformers have demonstrated exceptional in-context learning (ICL) capabilities, enabling applications across natural language processing, computer vision, and sequential decision-making. In reinforcement learning, ICL reframes learning as a supervised problem, facilitating task adaptation without parameter updates. Building on prior work leveraging transformers for sequential decision-making, we propose Heuristic Transformer (HT), an in-context reinforcement learning (ICRL) approach that augments the in-context dataset with a belief distribution over rewards to achieve better decision-making. Using a variational auto-encoder (VAE), a low-dimensional stochastic variable is learned to represent the posterior distribution over rewards, which is incorporated alongside an in-context dataset and query states as prompt to the transformer policy. We assess the performance of HT across the Darkroom, Miniworld, and MuJoCo environments, showing that it consistently surpasses comparable baselines in terms of both effectiveness and generalization. Our method presents a promising direction to bridge the gap between belief-based augmentations and transformer-based decision-making.
Problem

Research questions and friction points this paper is trying to address.

Enhancing in-context reinforcement learning with belief distributions over rewards
Incorporating reward posterior distributions via variational auto-encoders in transformers
Improving decision-making effectiveness and generalization across multiple environments
Innovation

Methods, ideas, or system contributions that make the work stand out.

Augments in-context dataset with belief distribution over rewards
Learns low-dimensional stochastic variable using variational auto-encoder
Incorporates belief distribution and query states into transformer policy
🔎 Similar Papers
No similar papers found.