🤖 AI Summary
To address information forgetting in symmetric power (sympow) linear Transformers—caused by limited recurrent state capacity in long-sequence modeling—this paper proposes the conformal-sympow architecture. Methodologically, it introduces (1) a data-dependent multiplicative gating mechanism to dynamically regulate state capacity allocation, and (2) adaptive rotary embeddings to enable position-aware, selective information storage. The design preserves key efficiency features of sympow Transformers, including linear-time attention computation and symmetric tensor embeddings. Empirically, conformal-sympow achieves stable performance on the LongCrawl64 benchmark when both training and evaluation contexts scale synchronously to 64K tokens, closely matching standard Softmax Transformers. This work establishes a new paradigm for linear Transformers in long-context settings, balancing theoretical soundness with practical deployability.
📝 Abstract
Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.