Hierarchical Kernel Transformer: Multi-Scale Attention with an Information-Theoretic Approximation Analysis

📅 2026-04-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

185K/year
🤖 AI Summary
This work addresses the high computational complexity of conventional attention mechanisms and their limited ability to model multiscale sequential dependencies. The authors propose the Hierarchical Kernel Transformer (HKT), which constructs L-layer multiscale attention through trainable causal downsampling and fuses attention score matrices across scales using learned convex weights, achieving substantially improved modeling efficiency with only a 1.31× increase in computational cost. Theoretically, this study establishes, for the first time, the positive-definite kernel property of multiscale attention, introducing a symmetric–antisymmetric decomposition and an information-theoretic approximation error analysis that rigorously subsumes standard attention and causal convolution as special cases. Experiments demonstrate that HKT yields accuracy gains of 4.77, 1.44, and 7.47 percentage points on ListOps, sCIFAR-10, and IMDB, respectively.

Technology Category

Application Category

📝 Abstract
The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.
Problem

Research questions and friction points this paper is trying to address.

multi-scale attention
computational efficiency
sequence modeling
hierarchical representation
attention mechanism
Innovation

Methods, ideas, or system contributions that make the work stand out.

Hierarchical Kernel Transformer
multi-scale attention
information-theoretic approximation
causal downsampling
asymmetric attention decomposition
🔎 Similar Papers