🤖 AI Summary
Standard next-token prediction leads to insufficiently constrained representation spaces, resulting in hidden state degradation and anisotropy that hinder model generalization. This work proposes Next Implicit Token Prediction (NITP), which leverages continuous representations from shallow network layers as self-supervised targets to impose dense semantic supervision on deeper representations. By doing so, NITP regularizes the training process without increasing inference overhead, enhancing the compactness and stability of the representation geometry. The method is compatible with both dense and Mixture-of-Experts (MoE) architectures and consistently improves downstream performance across models ranging from 0.5B to 9B parameters. Notably, a 9B MoE model achieves gains of 5.7%, 6.4%, and 4.3% on MMLU-Pro, C3, and CommonsenseQA, respectively, with only approximately 2% additional training FLOPs.
📝 Abstract
Standard next-token prediction (NTP) supervises language models solely through discrete labels in the output logit space. We argue that this sparse one-hot supervision leaves the latent representation space under-constrained, allowing hidden states to drift into degenerate and anisotropic configurations that can limit generalization. To address this issue, we propose Next Implicit Token Prediction (NITP), which augments discrete prediction with dense continuous supervision directly in the representation space. NITP trains the model to predict the implicit semantic content of the next token, using shallow-layer representations from the same model as stable self-supervised targets. We provide theoretical analysis showing that NITP regularizes the optimization landscape by mitigating under-constrained degrees of freedom and encouraging a compact, structured representation geometry. Empirically, across dense and MoE models ranging from 0.5B to 9B parameters, NITP consistently improves downstream performance with negligible computational overhead. On a 9B MoE model, NITP achieves a 5.7% absolute improvement on MMLU-Pro, along with gains of 6.4% on C3 and 4.3% on CommonsenseQA, with approximately 2% additional training FLOPs and no additional inference cost. Our implementation is available at https://github.com/aHapBean/NITP.