QUEST: A robust attention formulation using query-modulated spherical attention

📅 2026-03-31

📈 Citations: 0

✨ Influential: 0

career value

226K/year

🤖 AI Summary

This work addresses the training instability in standard Transformer attention mechanisms, which arises from unconstrained norms of query and key vectors—particularly in the presence of misleading data patterns. To mitigate this issue, the authors propose QUEST (Query-modulated Spherical aTtention), a plug-and-play replacement that constrains key vectors to lie on a hypersphere in latent space and introduces a query-modulated mechanism to dynamically control the sharpness of the attention distribution. By doing so, QUEST preserves expressive capacity while significantly enhancing training stability and model robustness. Empirical results demonstrate its effectiveness in alleviating performance degradation across vision tasks and improving resilience against data corruption and adversarial attacks.

Technology Category

Application Category

📝 Abstract

The Transformer model architecture has become one of the most widely used in deep learning and the attention mechanism is at its core. The standard attention formulation uses a softmax operation applied to a scaled dot product between query and key vectors. We explore the role played by norms of the queries and keys, which can cause training instabilities when they arbitrarily increase. We demonstrate how this can happen even in simple Transformer models, in the presence of easy-to-learn spurious patterns in the data. We propose a new attention formulation, QUEry-modulated Spherical aTtention (QUEST), that constrains the keys to a hyperspherical latent space, while still allowing individual tokens to flexibly control the sharpness of the attention distribution. QUEST can be easily used as a drop-in replacement for standard attention. We focus on vision applications while also exploring other domains to highlight the method's generality. We show that (1) QUEST trains without instabilities and (2) produces models with improved performance (3) that are robust to data corruptions and adversarial attacks.

Problem

Research questions and friction points this paper is trying to address.

attention mechanism

training instability

query-key norms

spurious patterns

Transformer

Innovation

Methods, ideas, or system contributions that make the work stand out.

spherical attention

query-modulated

training stability