🤖 AI Summary
How can NLP models’ cross-domain performance be reliably evaluated and ranked without labeled data? This paper proposes a two-stage unsupervised evaluation framework: first, jointly modeling prediction errors using four base classifiers and multiple large language models (LLMs); second, assessing ranking reliability by measuring alignment between inferred error patterns and domain discrepancies. Experiments on multi-domain benchmarks—GeoOLID and Amazon Reviews—demonstrate that ranking stability and correlation with true accuracy significantly improve when predicted error distributions closely match actual failure modes. Compared to distribution-shift–based or zero-shot alternatives, our approach substantially enhances robustness and interpretability of cross-domain performance ranking. Crucially, it is the first work to systematically characterize the applicability boundaries and key determinants—namely, error-pattern fidelity and domain-difference alignment—of label-free performance estimation.
📝 Abstract
Estimating model performance without labels is an important goal for understanding how NLP models generalize. While prior work has proposed measures based on dataset similarity or predicted correctness, it remains unclear when these estimates produce reliable performance rankings across domains. In this paper, we analyze the factors that affect ranking reliability using a two-step evaluation setup with four base classifiers and several large language models as error predictors. Experiments on the GeoOLID and Amazon Reviews datasets, spanning 15 domains, show that large language model-based error predictors produce stronger and more consistent rank correlations with true accuracy than drift-based or zero-shot baselines. Our analysis reveals two key findings: ranking is more reliable when performance differences across domains are larger, and when the error model's predictions align with the base model's true failure patterns. These results clarify when performance estimation methods can be trusted and provide guidance for their use in cross-domain model evaluation.