🤖 AI Summary
This study addresses the limitations of existing Chinese large language model (LLM) evaluations, which suffer from benchmark saturation and high computational costs, thereby failing to reveal the non-uniformity—or anisotropy—of model capabilities. To overcome these issues, the authors propose ReLE, a system that constructs an orthogonal evaluation matrix spanning 304 models and 207,843 samples across diverse domains and capabilities. ReLE introduces a symbol-grounded hybrid scoring mechanism to eliminate embedding-induced false positives in reasoning tasks and incorporates a Neyman allocation–based dynamic variance-aware scheduler, reducing computational costs by 70% while preserving high ranking correlation. Experimental results demonstrate a model ranking stability amplitude (RSA) of 11.4—substantially higher than the ~5.0 observed with conventional benchmarks—providing strong evidence that current Chinese LLMs exhibit pronounced specialization rather than comprehensive superiority.
📝 Abstract
Large Language Models (LLMs) have achieved rapid progress in Chinese language understanding, yet accurately evaluating their capabilities remains challenged by benchmark saturation and prohibitive computational costs. While static leaderboards provide snapshot rankings, they often mask the structural trade-offs between capabilities. In this work, we present ReLE (Robust Efficient Live Evaluation), a scalable system designed to diagnose Capability Anisotropy, the non-uniformity of model performance across domains. Using ReLE, we evaluate 304 models (189 commercial, 115 open-source) across a Domain $\times$ Capability orthogonal matrix comprising 207,843 samples. We introduce two methodological contributions to address current evaluation pitfalls: (1) A Symbolic-Grounded Hybrid Scoring Mechanism that eliminates embedding-based false positives in reasoning tasks; (2) A Dynamic Variance-Aware Scheduler based on Neyman allocation with noise correction, which reduces compute costs by 70\% compared to full-pass evaluations while maintaining a ranking correlation of $\rho=0.96$. Our analysis reveals that aggregate rankings are highly sensitive to weighting schemes: models exhibit a Rank Stability Amplitude (RSA) of 11.4 in ReLE versus $\sim$5.0 in traditional benchmarks, confirming that modern models are highly specialized rather than generally superior. We position ReLE not as a replacement for comprehensive static benchmarks, but as a high-frequency diagnostic monitor for the evolving model landscape.