TQL: Scaling Q-Functions with Transformers by Preventing Attention Collapse

📅 2026-02-01

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

This work addresses the challenge that Transformer-based Q-functions in reinforcement learning fail to scale effectively with model size, often suffering from training instability and performance degradation due to attention score collapse. The study identifies— for the first time—that this issue stems from a sharp decline in attention entropy as model capacity increases. To mitigate this, the authors propose an attention entropy regularization mechanism that explicitly controls the attention distribution, thereby preventing collapse. This approach substantially enhances both the training stability and performance of large-scale Transformer Q-functions, achieving up to a 43% performance gain when scaling from the smallest to the largest model configuration, whereas existing methods exhibit noticeable degradation under the same conditions.

Technology Category

Application Category

📝 Abstract

Despite scale driving substantial recent advancements in machine learning, reinforcement learning (RL) methods still primarily use small value functions. Naively scaling value functions -- including with a transformer architecture, which is known to be highly scalable -- often results in learning instability and worse performance. In this work, we ask what prevents transformers from scaling effectively for value functions? Through empirical analysis, we identify the critical failure mode in this scaling: attention scores collapse as capacity increases. Our key insight is that we can effectively prevent this collapse and stabilize training by controlling the entropy of the attention scores, thereby enabling the use of larger models. To this end, we propose Transformer Q-Learning (TQL), a method that unlocks the scaling potential of transformers in learning value functions in RL. Our approach yields up to a 43% improvement in performance when scaling from the smallest to the largest network sizes, while prior methods suffer from performance degradation.

Problem

Research questions and friction points this paper is trying to address.

reinforcement learning

value function scaling

attention collapse

transformer

Q-function

Innovation

Methods, ideas, or system contributions that make the work stand out.

attention collapse

entropy regularization

Transformer Q-Learning