🤖 AI Summary
Existing methods for estimating the parameter counts of closed-source large language models based on inference costs are highly susceptible to hardware and deployment variations, leading to substantial inaccuracies. This work proposes the Incompressible Knowledge Probe (IKP) benchmark, which estimates model scale through factual knowledge capacity rather than inference performance. Leveraging a dataset of 1,400 fact-based questions spanning seven levels of obscurity, the authors establish a log-linear relationship between accuracy and parameter count using 89 open-source models, validated via leave-one-out cross-validation. Results show a strong estimation fit (R² = 0.917), with 68.5% of predictions within a factor of two and 87.6% within a factor of three of ground truth. Total parameters—not just activated ones—better predict knowledge capacity in Mixture-of-Experts models, and factual knowledge capacity continues to scale log-linearly with parameter count, showing no signs of saturation.
📝 Abstract
Closed-source frontier labs do not disclose parameter counts, and the standard alternative -- inference economics -- carries $2\times$+ uncertainty from hardware, batching, and serving-stack assumptions external to the model. We exploit a tighter intrinsic bound: storing $F$ facts requires at least $F/$(bits per parameter) weights, so measuring how much a model \emph{knows} lower-bounds how many parameters it \emph{has}. We introduce \textbf{Incompressible Knowledge Probes (IKPs)}, a benchmark of 1{,}400 factual questions spanning 7 tiers of obscurity, designed to isolate knowledge that cannot be derived by reasoning or compressed by architectural improvements.
We calibrate a log-linear mapping from IKP accuracy to parameter count on 89 open-weight models (135M--1,600B) spanning 19 vendors, achieving $R^2 = 0.917$; leave-one-out cross-validation confirms generalization (median fold error $1.59\times$, $68.5\%$ within $2\times$ and $87.6\%$ within $3\times$). For Mixture-of-Experts models, total parameters predict knowledge ($R^2 = 0.79$) far better than active parameters ($R^2 = 0.51$). We evaluate 188 models from 27 vendors and estimate effective knowledge capacity for all major proprietary frontier models; for heavily safety-tuned models the estimates are lower bounds, since refusal policy can hide tens of percentage points of "refused but known" capacity.
The widely-reported saturation of reasoning benchmarks does not imply the end of scaling. Procedural capability compresses under the "Densing Law," but across 96 dated open-weight models the IKP time coefficient is $-0.0010$/month (95\% CI $[-0.0031, +0.0008]$) -- indistinguishable from zero, and rejecting the Densing prediction of $+0.0117$/month at $p < 10^{-15}$. Factual capacity continues to scale log-linearly with parameters across generations and across vendors.