π€ AI Summary
This work addresses the inefficiency of existing recursive or iterative large language models, which employ fixed inference depths and thus cannot adaptively allocate computational resources according to the difficulty of individual tokens, often resulting in suboptimal performance or wasted computation. To overcome this limitation, we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early-exit strategies during pretraining. Its key innovations include iteration-specific MLP gating, a monotonic halting mask, and KV cache reuse, enabling training-inference consistent adaptive computation without manual hyperparameter tuning. Evaluated across the Pythia model family (70Mβ2.8B parameters), AdaPonderLM reduces inference FLOPs by approximately 10% while maintaining comparable perplexity and downstream task performance; under identical computational budgets, it significantly outperforms fixed-depth pruning baselines.
π Abstract
Test-time scaling via recurrent/iterative Transformers enables large language models to spend more computation at inference, but most pretrained recurrent LMs run a fixed number of iterations, wasting compute on easy tokens and lacking token-wise adaptivity. Following the core idea of Adaptive Computation Time(ACT) and Early Exit(EE), we propose AdaPonderLM, a self-supervised recurrent language model that learns token-wise early exiting during pretraining without manually tuned per-token/per-layer pruning ratios. AdaPonderLM uses iteration-specific MLP gates with a monotonic halting mask to decide when each token stops recurring, and introduces a KV reuse mechanism that reuses cached key/value states for halted tokens, ensuring train--test consistency and practical acceleration. Across Pythia backbones from 70M to 410M (pretraining) and up to 2.8B (continued pretraining), AdaPonderLM reduces inference compute at about 10% while maintaining comparable language modeling perplexity and competitive downstream accuracy. Our analysis shows the learned gates allocate more computation to high-NLL (hard) tokens, exhibiting adaptive computation time behavior in a fully self-supervised setting. Meanwhile, under iso-FLOPs, the learned halting policy consistently outperforms fixed pruning, showing AdaPonderLM allocates compute to the right tokens rather than just reducing average depth.