🤖 AI Summary
Existing hyperbolic Transformer modules are incomplete—lacking hyperbolic linear transformations, LayerNorm, activation functions, and Dropout—and suffer from high computational complexity (O(n²)) in hyperbolic self-attention. To address these limitations, this work proposes the first fully hyperbolic Transformer architecture, grounded in the Lorentz model. We systematically introduce hyperbolic linear mappings, hyperbolic LayerNorm, hyperbolic Softmax, hyperbolic Dropout, and a linear-complexity hyperbolic self-attention mechanism (O(n)). Our architecture fully migrates all core Transformer components into hyperbolic space, significantly enhancing modeling capability for hierarchical data—including long sequences and billion-scale graphs. Experiments demonstrate consistent superiority over Euclidean Transformers and state-of-the-art hyperbolic models across diverse hierarchical data tasks. The proposed model achieves a 3.2× speedup in training time and supports sequences up to 16K tokens and graphs with up to one billion nodes.
📝 Abstract
Hyperbolic geometry have shown significant potential in modeling complex structured data, particularly those with underlying tree-like and hierarchical structures. Despite the impressive performance of various hyperbolic neural networks across numerous domains, research on adapting the Transformer to hyperbolic space remains limited. Previous attempts have mainly focused on modifying self-attention modules in the Transformer. However, these efforts have fallen short of developing a complete hyperbolic Transformer. This stems primarily from: (i) the absence of well-defined modules in hyperbolic space, including linear transformation layers, LayerNorm layers, activation functions, dropout operations, etc. (ii) the quadratic time complexity of the existing hyperbolic self-attention module w.r.t the number of input tokens, which hinders its scalability. To address these challenges, we propose, Hypformer, a novel hyperbolic Transformer based on the Lorentz model of hyperbolic geometry. In Hypformer, we introduce two foundational blocks that define the essential modules of the Transformer in hyperbolic space. Furthermore, we develop a linear self-attention mechanism in hyperbolic space, enabling hyperbolic Transformer to process billion-scale graph data and long-sequence inputs for the first time. Our experimental results confirm the effectiveness and efficiency of method across various datasets, demonstrating its potential as an effective and scalable solution for large-scale data representation and large models.