Kan Extension Transformers: A Categorical Unification of Attention, Diffusion, and Predict-Detach Self-Conditioning

📅 2026-05-26

📈 Citations: 0

✨ Influential: 0

career value

147K/year

🤖 AI Summary

This work proposes a unified framework grounded in categorical Kan extensions to model Transformer layers as weighted structure-extension operators, systematically integrating standard attention, geometric Transformers, and higher-order simplicial variants while revealing their underlying commonalities. The authors innovatively introduce a predict-and-disentangle self-conditioning mechanism that effectively prevents future information leakage in non-causal settings. Empirical evaluation across twelve architectural variants on Penn Treebank, WikiText-2, and WikiText-103 demonstrates that the quadratic Kan Extension Transformer (KET) achieves optimal performance under causal constraints, and the predict-and-disentangle mechanism yields the most significant performance gains.

📝 Abstract

We propose Kan Extension Transformers (KETs) as a unifying categorical framework for a diverse group of Transformer implementations. The core claim is that a Transformer layer can be viewed as a weighted structured extension operator: standard attention is the singleton-neighborhood case, Geometric Transformer style incidence mixing is a sparse edge-restricted case, and KET is the higher-order simplicial case. This lens also clarifies a bridge to diffusion-style completion. When the extension operator acts on detached predictive carriers instead of teacher-forced hidden states, it becomes a valid self-conditioning mechanism that exposes noncausal structure without leaking gold future tokens. We include a comprehensive experimental validation of 12 different Transformer implementations varying across strict-causal and predict-detach regimes on Penn Treebank, WikiText-2, and WikiText-103. In the strict-causal setting, quadratic KET is the strongest model among the compared causal architectures on WikiText-2 and WikiText-103. Across all datasets, however, the largest gains come from the predict-detach regime rather than from changing the neighborhood family alone.

Problem

Research questions and friction points this paper is trying to address.

Transformer

Kan Extension

Attention Mechanism

Diffusion

Self-Conditioning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Kan Extension Transformers

categorical framework

predict-detach self-conditioning