🤖 AI Summary
This paper addresses the reproducibility challenge of adaptive data selection strategies in transfer learning, exposing a fundamental trade-off between adaptation efficacy and result consistency under dynamic sample prioritization. We formally define selection sensitivity Δ_Q and theoretically prove that the probability of reproducibility failure grows quadratically with Δ_Q but decays exponentially with sample size; furthermore, source-domain pretraining substantially mitigates this risk. Empirical validation on MultiNLI—spanning six mainstream strategies (e.g., gradient-based selection, curriculum learning)—confirms the theory: highly adaptive methods improve performance yet incur >25% failure rates, whereas low-adaptivity strategies maintain <7% failure; source pretraining further reduces failure rates by up to 30%. Our core contribution is the first quantitative, empirically verifiable analytical framework for assessing the reliability of adaptive data selection in transfer learning.
📝 Abstract
The widespread adoption of transfer learning has revolutionized machine learning by enabling efficient adaptation of pre-trained models to new domains. However, the reliability of these adaptations remains poorly understood, particularly when using adaptive data selection strategies that dynamically prioritize training examples. We present a comprehensive theoretical and empirical analysis of replicability in transfer learning, introducing a mathematical framework that quantifies the fundamental trade-off between adaptation effectiveness and result consistency. Our key contribution is the formalization of selection sensitivity ($Δ_Q$), a measure that captures how adaptive selection strategies respond to perturbations in training data. We prove that replicability failure probability: the likelihood that two independent training runs produce models differing in performance by more than a threshold, increases quadratically with selection sensitivity while decreasing exponentially with sample size. Through extensive experiments on the MultiNLI corpus using six adaptive selection strategies - ranging from uniform sampling to gradient-based selection - we demonstrate that this theoretical relationship holds precisely in practice. Our results reveal that highly adaptive strategies like gradient-based and curriculum learning achieve superior task performance but suffer from high replicability failure rates, while less adaptive approaches maintain failure rates below 7%. Crucially, we show that source domain pretraining provides a powerful mitigation mechanism, reducing failure rates by up to 30% while preserving performance gains. These findings establish principled guidelines for practitioners to navigate the performance-replicability trade-off and highlight the need for replicability-aware design in modern transfer learning systems.