SCARV: Structure-Constrained Aggregation for Stable Sample Ranking in Redundant NLP Datasets

📅 2026-05-01

📈 Citations: 0

✨ Influential: 0

career value

164K/year

🤖 AI Summary

Existing sample-level ranking methods for NLP datasets often overlook redundant structures—such as duplicates or paraphrases—among samples, leading to unstable rankings across different random seeds. This work proposes SCARV, a novel framework that, for the first time, integrates redundancy structure into the ranking aggregation process. SCARV achieves modular and robust aggregation by leveraging redundancy cluster detection and a structure-aware assignment strategy over multiple subscore proxies. Experimental results demonstrate that SCARV substantially improves both global and local ranking stability on synthetic and real-world redundant data, while also enhancing the reproducibility of subset selection and outlier retrieval tasks.

📝 Abstract

Sample-level rankings are increasingly used in data-centric NLP for analysis, filtering, debugging, and curation, yet existing pipelines typically score training examples pointwise and rank them as if they were independent. This assumption is fragile in the presence of exact duplicates, near-duplicates, paraphrases, and other redundant structure common in NLP corpora, where stochastic training can make highly similar examples receive unstable relative orderings across random seeds. We study stable sample-level ranking under redundancy and propose \textsc{SCARV}, a modular aggregation framework that operates on top of an existing scoring proxy. \textsc{SCARV} combines robust multi-seed aggregation with a structure-aware aggregation/allocation step over redundancy clusters. Across synthetic redundancy, naturally mined QQP redundancy, multiple proxy families, several NLP tasks, and end-to-end DistilBERT fine-tuning, \textsc{SCARV} substantially improves over bare proxy rankings in global and local stability and yields more reproducible ranking-based decisions such as subset selection and suspicious-example retrieval. Our decomposition and compute-aware frontier sharpen the mechanism: robust multi-seed aggregation is the dominant generic stabilizer, while the structure-aware component adds value mainly under low aggregation budgets or when redundancy clusters are informative, naturally occurring, or sufficiently covered. These results position \textsc{SCARV} not as a universal data selector or a universally dominant replacement for seed-only aggregation, but as a stability-oriented aggregation layer for proxy-induced rankings in redundant NLP datasets.

Problem

Research questions and friction points this paper is trying to address.

sample ranking

data redundancy

stability

NLP datasets

reproducibility

Innovation

Methods, ideas, or system contributions that make the work stand out.

sample ranking

redundancy

structure-aware aggregation