🤖 AI Summary
Existing scaling laws for language models rely heavily on cross-entropy loss, which measures only the absolute probability assigned to correct tokens—ignoring models’ relative ranking ability among candidate tokens, a property critical to practical decoding strategies such as greedy decoding.
Method: This work introduces a novel *relative-ranking perspective* on scaling laws, proposing the Relative-Based Probability (RBP) metric to quantify the rank quality of correct tokens within the model’s predicted distribution, and deriving a “Relativity Scaling Law” that characterizes the power-law relationship between model scale and correct-token rank.
Contribution/Results: The law is empirically validated across four model families, four diverse datasets, and five orders of magnitude in parameter count, demonstrating strong robustness. By shifting focus from absolute likelihood to relative ordering, this framework transcends conventional evaluation paradigms, offers a mechanistic explanation for emergent capabilities in large language models, and advances scaling laws toward a unified theoretical foundation.
📝 Abstract
Scaling laws aim to accurately predict model performance across different scales. Existing scaling-law studies almost exclusively rely on cross-entropy as the evaluation metric. However, cross-entropy provides only a partial view of performance: it measures the absolute probability assigned to the correct token, but ignores the relative ordering between correct and incorrect tokens. Yet, relative ordering is crucial for language models, such as in greedy-sampling scenario. To address this limitation, we investigate scaling from the perspective of relative ordering. We first propose the Relative-Based Probability (RBP) metric, which quantifies the probability that the correct token is ranked among the top predictions. Building on this metric, we establish the Relative-Based Scaling Law, which characterizes how RBP improves with increasing model size. Through extensive experiments on four datasets and four model families spanning five orders of magnitude, we demonstrate the robustness and accuracy of this law. Finally, we illustrate the broad application of this law with two examples, namely providing a deeper explanation of emergence phenomena and facilitating finding fundamental theories of scaling laws. In summary, the Relative-Based Scaling Law complements the cross-entropy perspective and contributes to a more complete understanding of scaling large language models. Thus, it offers valuable insights for both practical development and theoretical exploration.