🤖 AI Summary
This work addresses the high computational complexity of conventional attention mechanisms and their limited ability to model multiscale sequential dependencies. The authors propose the Hierarchical Kernel Transformer (HKT), which constructs L-layer multiscale attention through trainable causal downsampling and fuses attention score matrices across scales using learned convex weights, achieving substantially improved modeling efficiency with only a 1.31× increase in computational cost. Theoretically, this study establishes, for the first time, the positive-definite kernel property of multiscale attention, introducing a symmetric–antisymmetric decomposition and an information-theoretic approximation error analysis that rigorously subsumes standard attention and causal convolution as special cases. Experiments demonstrate that HKT yields accuracy gains of 4.77, 1.44, and 7.47 percentage points on ListOps, sCIFAR-10, and IMDB, respectively.
📝 Abstract
The Hierarchical Kernel Transformer (HKT) is a multi-scale attention mechanism that processes sequences at L resolution levels via trainable causal downsampling, combining level-specific score matrices through learned convex weights. The total computational cost is bounded by 4/3 times that of standard attention, reaching 1.3125x for L = 3. Four theoretical results are established. (i) The hierarchical score matrix defines a positive semidefinite kernel under a sufficient condition on the symmetrised bilinear form (Proposition 3.1). (ii) The asymmetric score matrix decomposes uniquely into a symmetric part controlling reciprocal attention and an antisymmetric part controlling directional attention; HKT provides L independent such pairs across scales, one per resolution level (Propositions 3.5-3.6). (iii) The approximation error decomposes into three interpretable components with an explicit non-Gaussian correction and a geometric decay bound in L (Theorem 4.3, Proposition 4.4). (iv) HKT strictly subsumes single-head standard attention and causal convolution (Proposition 3.4). Experiments over 3 random seeds show consistent gains over retrained standard attention baselines: +4.77pp on synthetic ListOps (55.10+-0.29% vs 50.33+-0.12%, T = 512), +1.44pp on sequential CIFAR-10 (35.45+-0.09% vs 34.01+-0.19%, T = 1,024), and +7.47pp on IMDB character-level sentiment (70.19+-0.57% vs 62.72+-0.40%, T = 1,024), all at 1.31x overhead.