π€ AI Summary
Large language models (LLMs) exhibit limited performance on long-context understanding tasks, primarily because standard next-token prediction training treats all tokens uniformly, ignoring the heterogeneous context-length requirements across tokens. To address this, we propose a confidence-difference-driven dynamic token weighting training paradigm: by contrasting the prediction confidence of short- and long-context models on the same token, we construct a fine-grained loss reweighting mechanism. We introduce the first systematic two-stage token weighting framework; empirically demonstrate that lightweight small models can efficiently provide token importance scores for large models; and validate that non-uniform loss weighting yields critical gains for long-range modeling. Our method achieves significant improvements across multiple long-context benchmarks. To facilitate reproducibility, we open-source our implementation and provide comprehensive fine-tuning guidelines.
π Abstract
Many applications of large language models (LLMs) require long-context understanding, but models continue to struggle with such tasks. We hypothesize that conventional next-token prediction training could contribute to this, because each token is assigned equal weight. Yet, intuitively, the amount of context needed to predict the next token accurately varies greatly across different data. To reflect this, we propose various novel token-weighting schemes that assign different weights to each training token in the loss, thereby generalizing existing works. For this, we categorize token-weighting methods using a two-step framework which compares the confidences of a long-context and short-context model to score tokens. We evaluate all methods on multiple long-context understanding tasks and show that non-uniform loss weights are helpful to improve the long-context abilities of LLMs. Different short-context models can be used effectively for token scoring, including models that are much smaller than the long-context model that is trained. All in all, this work contributes to a better understanding of the trade-offs long-context language modeling faces and provides guidelines for model steering via loss-weighting based on empirical evidence. The code can be found on Github.