Fast and Simplex: 2-Simplicial Attention in Triton

📅 2025-07-03

📈 Citations: 0

✨ Influential: 0

career value

165K/year

🤖 AI Summary

Large language models (LLMs) suffer from low token efficiency due to their reliance on massive internet-scale corpora. To address this, we propose the 2-simplicial Transformer—a novel architecture that generalizes dot-product attention to a trilinear form for the first time, thereby provably altering the scaling exponent of knowledge acquisition and reasoning tasks. We further design a high-performance, Triton-based GPU kernel that optimizes memory access patterns and computational throughput, significantly improving token utilization. Empirical evaluation across mathematical reasoning, code generation, logical deduction, and multi-step inference shows consistent superiority over standard Transformers at equivalent parameter counts, demonstrating enhanced modeling capacity and generalization under fixed token budgets. Our core contributions are: (i) a theoretically grounded attention paradigm with improved asymptotic scalability, and (ii) its efficient, system-level implementation enabling practical deployment.

Technology Category

Application Category

📝 Abstract

Recent work has shown that training loss scales as a power law with both model size and the number of tokens, and that achieving compute-optimal models requires scaling model size and token count together. However, these scaling laws assume an infinite supply of data and apply primarily in compute-bound settings. As modern large language models increasingly rely on massive internet-scale datasets, the assumption that they are compute-bound is becoming less valid. This shift highlights the need for architectures that prioritize token efficiency. In this work, we investigate the use of the 2-simplicial Transformer, an architecture that generalizes standard dot-product attention to trilinear functions through an efficient Triton kernel implementation. We demonstrate that the 2-simplicial Transformer achieves better token efficiency than standard Transformers: for a fixed token budget, similarly sized models outperform their dot-product counterparts on tasks involving mathematics, coding, reasoning, and logic. We quantify these gains by demonstrating that $2$-simplicial attention changes the exponent in the scaling laws for knowledge and reasoning tasks compared to dot product attention.

Problem

Research questions and friction points this paper is trying to address.

Improving token efficiency in large language models

Generalizing dot-product attention to trilinear functions

Enhancing performance in math, coding, and reasoning tasks

Innovation

Methods, ideas, or system contributions that make the work stand out.

2-simplicial Transformer for token efficiency

Trilinear attention via Triton kernel

Improved scaling laws for reasoning tasks

🔎 Similar Papers

No similar papers found.