Can We Reliably Rank Model Performance across Domains without Labeled Data?

📅 2025-10-10

📈 Citations: 0

✨ Influential: 0

career value

178K/year

🤖 AI Summary

How can NLP models’ cross-domain performance be reliably evaluated and ranked without labeled data? This paper proposes a two-stage unsupervised evaluation framework: first, jointly modeling prediction errors using four base classifiers and multiple large language models (LLMs); second, assessing ranking reliability by measuring alignment between inferred error patterns and domain discrepancies. Experiments on multi-domain benchmarks—GeoOLID and Amazon Reviews—demonstrate that ranking stability and correlation with true accuracy significantly improve when predicted error distributions closely match actual failure modes. Compared to distribution-shift–based or zero-shot alternatives, our approach substantially enhances robustness and interpretability of cross-domain performance ranking. Crucially, it is the first work to systematically characterize the applicability boundaries and key determinants—namely, error-pattern fidelity and domain-difference alignment—of label-free performance estimation.

Technology Category

Application Category

📝 Abstract

Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.

Problem

Research questions and friction points this paper is trying to address.

Evaluating model performance ranking reliability across domains without labeled data

Analyzing factors affecting cross-domain ranking reliability using error predictors

Determining when performance estimation methods produce trustworthy domain rankings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Large language models predict cross-domain performance rankings

Error predictors outperform drift-based and zero-shot baselines

Reliability depends on performance gaps and failure pattern alignment

🔎 Similar Papers

Ranking Generated Answers: On the Agreement of Retrieval Models with Humans on Consumer Health Questions