INCRT: An Incremental Transformer That Determines Its Own Architecture

📅 2026-04-12
📈 Citations: 0
Influential: 0
📄 PDF

career value

221K/year
🤖 AI Summary
This work proposes INCRT, an incremental adaptive Transformer that overcomes the limitations of conventional Transformers, which rely on fixed, predefined architectures and often suffer from redundant attention heads ill-suited to dynamic task demands. Starting with a single attention head, INCRT incrementally adds or removes heads during training based on task-specific directional structure, thereby automatically constructing a minimal yet sufficient attention architecture without requiring any preset design or validation phase. The method provides theoretical guarantees on convergence and an upper bound on the number of attention heads. Structural growth is driven by an online-computable geometric quantity, while pruning is guided by spectral complexity. Evaluated on SARS-CoV-2 variant classification and SST-2 sentiment analysis, INCRT reduces parameters by 3–7× compared to BERT-base while matching or exceeding its performance, with predicted head counts deviating from empirical measurements by less than 12%.

Technology Category

Application Category

📝 Abstract
Transformer architectures are designed by trial and error: the number of attention heads, the depth, and the head size are fixed before training begins, with no mathematical principle to guide the choice. The result is systematic structural redundancy -- between half and four-fifths of all heads in a trained model can be removed without measurable loss -- because the architecture allocates capacity without reference to the actual requirements of the task.This paper introduces INCRT (Incremental Transformer), an architecture that determines its own structure during training. Starting from a single head, INCRT adds one attention head at a time whenever its current configuration is provably insufficient, and prunes heads that have become redundant. Each growth decision is driven by a single, online-computable geometric quantity derived from the task's directional structure, requiring no separate validation phase and no hand-tuned schedule. Two theorems form the theoretical backbone. The first (homeostatic convergence) establishes that the system always reaches a finite stopping configuration that is simultaneously minimal (no redundant heads) and sufficient (no uncaptured directional energy above the threshold). The second (compressed-sensing analogy) provides a geometric upper bound on the number of heads that this configuration can contain, as a function of the spectral complexity of the task. Experiments on SARS-CoV-2 variant classification and SST-2 sentiment analysis confirm both results: the predicted and observed head counts agree within 12% across all benchmarks, and the final architectures match or exceed BERT-base on distribution-specific tasks while using between three and seven times fewer parameters and no pre-training.
Problem

Research questions and friction points this paper is trying to address.

Transformer architecture
structural redundancy
attention heads
capacity allocation
task-specific requirements
Innovation

Methods, ideas, or system contributions that make the work stand out.

Incremental Transformer
adaptive architecture
attention head pruning
geometric convergence
parameter efficiency
🔎 Similar Papers
No similar papers found.