Conformal Transformations for Symmetric Power Transformers

📅 2025-03-05

📈 Citations: 0

✨ Influential: 0

career value

156K/year

🤖 AI Summary

To address information forgetting in symmetric power (sympow) linear Transformers—caused by limited recurrent state capacity in long-sequence modeling—this paper proposes the conformal-sympow architecture. Methodologically, it introduces (1) a data-dependent multiplicative gating mechanism to dynamically regulate state capacity allocation, and (2) adaptive rotary embeddings to enable position-aware, selective information storage. The design preserves key efficiency features of sympow Transformers, including linear-time attention computation and symmetric tensor embeddings. Empirically, conformal-sympow achieves stable performance on the LongCrawl64 benchmark when both training and evaluation contexts scale synchronously to 64K tokens, closely matching standard Softmax Transformers. This work establishes a new paradigm for linear Transformers in long-context settings, balancing theoretical soundness with practical deployability.

Technology Category

Application Category

📝 Abstract

Transformers with linear attention offer significant computational advantages over softmax-based transformers but often suffer from degraded performance. The symmetric power (sympow) transformer, a particular type of linear transformer, addresses some of this performance gap by leveraging symmetric tensor embeddings, achieving comparable performance to softmax transformers. However, the finite capacity of the recurrent state in sympow transformers limits their ability to retain information, leading to performance degradation when scaling the training or evaluation context length. To address this issue, we propose the conformal-sympow transformer, which dynamically frees up capacity using data-dependent multiplicative gating and adaptively stores information using data-dependent rotary embeddings. Preliminary experiments on the LongCrawl64 dataset demonstrate that conformal-sympow overcomes the limitations of sympow transformers, achieving robust performance across scaled training and evaluation contexts.

Problem

Research questions and friction points this paper is trying to address.

Addresses performance degradation in linear attention transformers.

Enhances information retention in symmetric power transformers.

Improves scalability in training and evaluation contexts.

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic capacity management via multiplicative gating

Adaptive information storage with rotary embeddings

Enhanced performance in scaled training contexts

🔎 Similar Papers

Hypformer: Exploring Efficient Transformer Fully in Hyperbolic Space