🤖 AI Summary
NLP model evaluation faces critical challenges including benchmark saturation, data contamination, and imbalanced test instance quality. To address these, we propose SMART, a novel filtering framework that systematically integrates three criteria—triviality removal, contamination detection, and embedding-space diversity constraints—to automatically construct a compact, high-information, high-challenge, and low-redundancy subset from existing benchmarks. Specifically, SMART identifies trivial instances via prediction confidence thresholds, detects contamination through training-set overlap analysis, and enforces semantic diversity via clustering in pretrained embedding space. Evaluated on three multiple-choice QA benchmarks, SMART achieves an average 48% compression ratio. It attains significantly higher Pearson correlation with human evaluations (e.g., Chatbot Arena) than baseline benchmarks, while strictly preserving relative model rankings. This ensures enhanced evaluation reliability, computational efficiency, and robustness against dataset artifacts.
📝 Abstract
One of the most challenging problems facing NLP today is evaluation. Some of the most pressing issues pertain to benchmark saturation, data contamination, and diversity in the quality of test examples. To address these concerns, we propose Selection Methodology for Accurate, Reduced, and Targeted (SMART) filtering, a novel approach to select a high-quality subset of examples from existing benchmark datasets by systematically removing less informative and less challenging examples. Our approach applies three filtering criteria, removing (i) easy examples, (ii) data-contaminated examples, and (iii) examples that are similar to each other based on distance in an embedding space. We demonstrate the effectiveness of SMART on three multiple choice QA datasets, where our methodology increases efficiency by reducing dataset size by 48% on average, while increasing Pearson correlation with rankings from ChatBot Arena, a more open-ended human evaluation setting. Our method enables us to be more efficient, whether using SMART to make new benchmarks more challenging or to revitalize older datasets, while still preserving the relative model rankings.