STAR : Bridging Statistical and Agentic Reasoning for Large Model Performance Prediction

📅 2026-02-12

📈 Citations: 0

✨ Influential: 0

career value

179K/year

🤖 AI Summary

This work addresses the critical challenge of accurately predicting the performance of large language models under extreme data sparsity—where only one or two observed scores are available—amidst the high cost of comprehensive evaluation. The authors propose a novel approach that integrates statistical modeling with knowledge-driven agent reasoning: constrained probabilistic matrix factorization yields initial predictions with uncertainty estimates, which are then enhanced by domain-specific knowledge retrieval to enrich semantic features. A multi-dimensional correction and credibility-aware aggregation mechanism, guided by Expected Value Theory (EVT), further refines these predictions. By uniquely unifying statistical expectation with EVT-based reasoning, the method achieves a 14.46% improvement over the strongest statistical baseline in overall score prediction and consistently outperforms existing approaches in both scoring and ranking tasks, while maintaining strong interpretability and robustness.

Technology Category

Application Category

📝 Abstract

As comprehensive large model evaluation becomes prohibitively expensive, predicting model performance from limited observations has become essential. However, existing statistical methods struggle with pattern shifts, data sparsity, and lack of explanation, while pure LLM methods remain unreliable. We propose STAR, a framework that bridges data-driven STatistical expectations with knowledge-driven Agentic Reasoning. STAR leverages specialized retrievers to gather external knowledge and embeds semantic features into Constrained Probabilistic Matrix Factorization (CPMF) to generate statistical expectations with uncertainty. A reasoning module guided by Expectation Violation Theory (EVT) then refines predictions through intra-family analysis, cross-model comparison, and credibility-aware aggregation, producing adjustments with traceable explanations. Extensive experiments show that STAR consistently outperforms all baselines on both score-based and rank-based metrics, delivering a 14.46% gain in total score over the strongest statistical method under extreme sparsity, with only 1--2 observed scores per test model.

Problem

Research questions and friction points this paper is trying to address.

large model performance prediction

data sparsity

statistical reasoning

agentic reasoning

model evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Statistical-Agentic Reasoning

Constrained Probabilistic Matrix Factorization

Expectation Violation Theory