🤖 AI Summary
Existing acceleration methods for autoregressive Transformers typically enhance inference efficiency by reducing per-token computation, often at the cost of generation quality. This work proposes N-vium, a multi-exit hybrid Transformer architecture that boosts effective FLOPs utilization through partial depth-wise parallelization on standard hardware, rather than reducing computation, thereby achieving acceleration without compromising model quality. Its core innovations include a learnable multi-exit hybrid prediction mechanism and token-adaptive routing, which strictly generalize the standard Transformer while enabling exact sampling and full KV cache reconstruction. Evaluated at a 1.5B parameter scale, N-vium achieves a 57.9% end-to-end inference speedup over the baseline without increasing perplexity.
📝 Abstract
Improving the inference efficiency of autoregressive transformers typically means reducing FLOPs per token, usually through approximations that degrade model quality. We introduce N-vium, a mixture-of-exits transformer that partially parallelizes computation across depth on standard hardware, increasing effective FLOPs per second rather than minimizing compute per token. N-vium attaches prediction heads at multiple depths and defines the next-token distribution as a learned mixture over these exits, with token-adaptive routing. This formulation strictly generalizes the standard transformer, which is recovered exactly when routing assigns zero mass to all intermediate heads. Sampling from the mixture is exact, and complete KV caches are recovered by deferring the upper-layer computation and batching it with later tokens. We pretrain N-vium at scales up to 1.5B parameters. Our largest model reaches 57.9% wall-clock speedup over a parameter- and data-matched standard transformer at no perplexity cost.