🤖 AI Summary
Existing evaluation metrics for large language model training, such as perplexity, are susceptible to noise in long-context scenarios and exhibit weak correlation with downstream software engineering task performance. To address this, this work proposes the High-Entropy Signal-to-Noise Ratio (HE-SNR) metric under a “reasonable hesitation” regime. HE-SNR leverages fine-grained entropy analysis to uncover latent logical structures during training and introduces the entropy compression hypothesis, redefining intelligence as the capacity to structurally manage uncertainty—thereby transcending conventional scalar compression paradigms. By integrating data filtering with fine-grained entropy computation, HE-SNR demonstrates enhanced robustness on industrial-scale Mixture-of-Experts (MoE) models with 32K/128K context lengths and significantly improves predictive accuracy for SWE-Bench task performance.
📝 Abstract
SWE-bench has emerged as the premier benchmark for evaluating Large Language Models on complex software engineering tasks. While these capabilities are fundamentally acquired during the mid-training phase and subsequently elicited during Supervised Fine-Tuning (SFT), there remains a critical deficit in metrics capable of guiding mid-training effectively. Standard metrics such as Perplexity (PPL) are compromised by the"Long-Context Tax"and exhibit weak correlation with downstream SWE performance. In this paper, we bridge this gap by first introducing a rigorous data filtering strategy. Crucially, we propose the Entropy Compression Hypothesis, redefining intelligence not by scalar Top-1 compression, but by the capacity to structure uncertainty into Entropy-Compressed States of low orders ("reasonable hesitation"). Grounded in this fine-grained entropy analysis, we formulate a novel metric, HE-SNR (High-Entropy Signal-to-Noise Ratio). Validated on industrial-scale Mixture-of-Experts (MoE) models across varying context windows (32K/128K), our approach demonstrates superior robustness and predictive power. This work provides both the theoretical foundation and practical tools for optimizing the latent potential of LLMs in complex engineering domains.