Quantifying Dataset Similarity to Guide Transfer Learning

📅 2025-10-12

📈 Citations: 0

✨ Influential: 0

career value

204K/year

🤖 AI Summary

In transfer learning, blind adaptation from poorly aligned source to target domains often degrades performance. Existing similarity measures rely solely on feature distribution alignment, neglecting label structure and decision boundary relationships, thus failing to reliably predict transfer efficacy. To address this, we propose the Cross-Learning Score (CLS), a dataset similarity metric grounded in bidirectional generalization performance. CLS establishes, for the first time, a theoretical link between dataset similarity and cosine similarity of decision boundaries. It introduces a three-region classification framework—forward, ambiguous, and negative transfer—enabling principled transferability assessment. Compatible with encoder-head architectures, CLS avoids costly high-dimensional distribution estimation and ensures computational efficiency. Extensive experiments across synthetic and real-world benchmarks demonstrate that CLS reliably predicts transfer gain, significantly enhancing the scientific rigor and robustness of source dataset selection.

Technology Category

Application Category

📝 Abstract

Transfer learning has become a cornerstone of modern machine learning, as it can empower models by leveraging knowledge from related domains to improve learning effectiveness. However, transferring from poorly aligned data can harm rather than help performance, making it crucial to determine whether the transfer will be beneficial before implementation. This work aims to address this challenge by proposing an innovative metric to measure dataset similarity and provide quantitative guidance on transferability. In the literature, existing methods largely focus on feature distributions while overlooking label information and predictive relationships, potentially missing critical transferability insights. In contrast, our proposed metric, the Cross-Learning Score (CLS), measures dataset similarity through bidirectional generalization performance between domains. We provide a theoretical justification for CLS by establishing its connection to the cosine similarity between the decision boundaries for the target and source datasets. Computationally, CLS is efficient and fast to compute as it bypasses the problem of expensive distribution estimation for high-dimensional problems. We further introduce a general framework that categorizes source datasets into positive, ambiguous, or negative transfer zones based on their CLS relative to the baseline error, enabling informed decisions. Additionally, we extend this approach to encoder-head architectures in deep learning to better reflect modern transfer pipelines. Extensive experiments on diverse synthetic and real-world tasks demonstrate that CLS can reliably predict whether transfer will improve or degrade performance, offering a principled tool for guiding data selection in transfer learning.

Problem

Research questions and friction points this paper is trying to address.

Measuring dataset similarity to predict transfer learning effectiveness

Addressing limitations of existing methods that overlook label relationships

Providing quantitative guidance for selecting beneficial source datasets

Innovation

Methods, ideas, or system contributions that make the work stand out.

CLS measures dataset similarity via bidirectional generalization

CLS connects to cosine similarity of decision boundaries

CLS categorizes datasets into transfer zones for guidance

🔎 Similar Papers

No similar papers found.