🤖 AI Summary
Existing data governance methods predominantly rely on coarse-grained sample removal, failing to eliminate intra-sample redundancy in annotations and thus limiting vision–language alignment performance. To address this, we propose DataJuicer, the first “juice-extraction” paradigm for fine-grained data governance, enabling token-level dynamic purification and alignment enhancement. Specifically, the visual branch employs ViT-based patch compression guided by saliency maps and extracts key objects; the textual branch improves semantic consistency via category-aware caption rewriting; and differentiable token importance modeling coupled with cross-modal alignment optimization is introduced. Compared to DataSieve, DataJuicer achieves an average 4.2% improvement across image–text retrieval, classification, and dense visual reasoning tasks, while attaining a 37% equivalent data compression rate without performance degradation—overcoming the limitations of conventional scalar-score-driven, static filtering approaches.
📝 Abstract
Widely observed data scaling laws, in which error falls off as a power of the training size, demonstrate the diminishing returns of unselective data expansion. Hence, data governance is proposed to downsize datasets through pruning non-informative samples. Yet, isolating the impact of a specific sample on overall model performance is challenging, due to the vast computation required for tryout all sample combinations. Current data governors circumvent this complexity by estimating sample contributions through heuristic-derived scalar scores, thereby discarding low-value ones. Despite thorough sample sieving, retained samples contain substantial undesired tokens intrinsically, underscoring the potential for further compression and purification. In this work, we upgrade data governance from a 'sieving' approach to a 'juicing' one. Instead of scanning for least-flawed samples, our dual-branch DataJuicer applies finer-grained intra-sample governance. It squeezes out informative tokens and boosts image-text alignments. Specifically, the vision branch retains salient image patches and extracts relevant object classes, while the text branch incorporates these classes to enhance captions. Consequently, DataJuicer yields more refined datasets through finer-grained governance. Extensive experiments across datasets demonstrate that DataJuicer significantly outperforms existing DataSieve in image-text retrieval, classification, and dense visual reasoning.