HELM: Hyperbolic Large Language Models via Mixture-of-Curvature Experts

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

231K/year

🤖 AI Summary

Existing large language models (LLMs) rely on Euclidean geometry, which struggles to capture the intrinsic semantic hierarchies and nonlinear geometric structure of language—leading to training instability and constrained generative capacity. To address this, we propose the first billion-parameter LLM fully embedded in hyperbolic space, featuring an end-to-end hyperbolic Transformer architecture. Our method introduces three key innovations: (1) Hyperbolic Expert Mixture with Curvature Estimation (HELM-MICE) for dynamic curvature-aware routing; (2) Hyperbolic Multi-Head Latent Attention (HMLA) to improve KV cache efficiency; and (3) hyperbolic rotary position encoding and hyperbolic RMS normalization. Evaluated on MMLU, ARC, and other benchmarks, our model achieves up to 4% absolute gains over LLaMA and DeepSeek, with particularly strong improvements in STEM reasoning, commonsense understanding, and knowledge-intensive tasks. This work establishes a novel geometric foundation for language modeling grounded in hyperbolic geometry.

Technology Category

Application Category

📝 Abstract

Large language models (LLMs) have shown great success in text modeling tasks across domains. However, natural language exhibits inherent semantic hierarchies and nuanced geometric structure, which current LLMs do not capture completely owing to their reliance on Euclidean operations. Recent studies have also shown that not respecting the geometry of token embeddings leads to training instabilities and degradation of generative capabilities. These findings suggest that shifting to non-Euclidean geometries can better align language models with the underlying geometry of text. We thus propose to operate fully in Hyperbolic space, known for its expansive, scale-free, and low-distortion properties. We thus introduce HELM, a family of HypErbolic Large Language Models, offering a geometric rethinking of the Transformer-based LLM that addresses the representational inflexibility, missing set of necessary operations, and poor scalability of existing hyperbolic LMs. We additionally introduce a Mixture-of-Curvature Experts model, HELM-MICE, where each expert operates in a distinct curvature space to encode more fine-grained geometric structure from text, as well as a dense model, HELM-D. For HELM-MICE, we further develop hyperbolic Multi-Head Latent Attention (HMLA) for efficient, reduced-KV-cache training and inference. For both models, we develop essential hyperbolic equivalents of rotary positional encodings and RMS normalization. We are the first to train fully hyperbolic LLMs at billion-parameter scale, and evaluate them on well-known benchmarks such as MMLU and ARC, spanning STEM problem-solving, general knowledge, and commonsense reasoning. Our results show consistent gains from our HELM architectures -- up to 4% -- over popular Euclidean architectures used in LLaMA and DeepSeek, highlighting the efficacy and enhanced reasoning afforded by hyperbolic geometry in large-scale LM pretraining.

Problem

Research questions and friction points this paper is trying to address.

Natural language has semantic hierarchies not captured by Euclidean LLMs

Hyperbolic space better aligns with text's geometric structure

Existing hyperbolic LMs lack scalability and representational flexibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

Hyperbolic space for better text geometry alignment

Mixture-of-Curvature Experts for fine-grained structure

Hyperbolic equivalents of key Transformer components

🔎 Similar Papers

No similar papers found.