🤖 AI Summary
To address the challenge of fixed computational budgets in autoregressive sequence modeling—where models struggle to dynamically adapt inference cost to varying test-time requirements—this paper introduces AbbIE, an encoder-only blockwise autoregressive iterative encoder. AbbIE recursively refines latent representations of input blocks via iterative encoding; it is trained with only two iterations but supports arbitrary iteration counts at inference time, without requiring additional data, fine-tuning, or specialized training protocols. This design enables fine-grained, on-the-fly trade-offs between computational cost and modeling performance. Evaluated at the 350M parameter scale, AbbIE achieves a 5% reduction in language modeling perplexity and a 12% improvement in zero-shot in-context learning accuracy. By decoupling training efficiency from inference flexibility, AbbIE establishes a novel paradigm for efficient scaling of Transformer-based architectures.
📝 Abstract
We introduce the Autoregressive Block-Based Iterative Encoder (AbbIE), a novel recursive generalization of the encoder-only Transformer architecture, which achieves better perplexity than a standard Transformer and allows for the dynamic scaling of compute resources at test time. This simple, recursive approach is a complement to scaling large language model (LLM) performance through parameter and token counts. AbbIE performs its iterations in latent space, but unlike latent reasoning models, does not require a specialized dataset or training protocol. We show that AbbIE upward generalizes (ability to generalize to arbitrary iteration lengths) at test time by only using 2 iterations during train time, far outperforming alternative iterative methods. AbbIE's ability to scale its computational expenditure based on the complexity of the task gives it an up to extbf{12%} improvement in zero-shot in-context learning tasks versus other iterative and standard methods and up to 5% improvement in language perplexity. The results from this study open a new avenue to Transformer performance scaling. We perform all of our evaluations on model sizes up to 350M parameters.