Scaling Laws for Data-Efficient Visual Transfer Learning

📅 2025-04-17

📈 Citations: 0

✨ Influential: 0

career value

217K/year

🤖 AI Summary

This work addresses two fundamental challenges in data-constrained downstream vision tasks: (1) the poorly understood scaling behavior of visual AI models, and (2) the unclear mechanistic basis for knowledge distillation’s effectiveness. We establish the first scaling law framework tailored to few-shot transfer learning. We propose the “distillation boundary theory,” which formally characterizes the critical threshold at which performance reversal occurs between data-scarce and data-abundant regimes, revealing a phase transition governed by the relative dominance of distillation versus pretraining. Through systematic empirical evaluation across 1K–1M training samples and model sizes ranging from 2.5M to 38M parameters—leveraging error differential curve modeling and distillation/fine-tuning comparative paradigms—we demonstrate the universality of this critical threshold: distillation yields substantial accuracy gains (+12.3%) under low-data regimes, yet is consistently outperformed by end-to-end fine-tuning in high-data settings. Our findings redefine data-efficient scaling laws and provide a theoretical foundation for lightweight deployment and low-resource vision learning.

Technology Category

Application Category

📝 Abstract

Current scaling laws for visual AI models focus predominantly on large-scale pretraining, leaving a critical gap in understanding how performance scales for data-constrained downstream tasks. To address this limitation, this paper establishes the first practical framework for data-efficient scaling laws in visual transfer learning, addressing two fundamental questions: 1) How do scaling behaviors shift when downstream tasks operate with limited data? 2) What governs the efficacy of knowledge distillation under such constraints? Through systematic analysis of vision tasks across data regimes (1K-1M samples), we propose the distillation boundary theory, revealing a critical turning point in distillation efficiency: 1) Distillation superiority: In data-scarce conditions, distilled models significantly outperform their non-distillation counterparts, efficiently leveraging inherited knowledge to compensate for limited training samples. 2) Pre-training dominance: As pre-training data increases beyond a critical threshold, non-distilled models gradually surpass distilled versions, suggesting diminishing returns from knowledge inheritance when sufficient task-specific data becomes available. Empirical validation across various model scales (2.5M to 38M parameters) and data volumes demonstrate these performance inflection points, with error difference curves transitioning from positive to negative values at critical data thresholds, confirming our theoretical predictions. This work redefines scaling laws for data-limited regimes, bridging the knowledge gap between large-scale pretraining and practical downstream adaptation, addressing a critical barrier to understanding vision model scaling behaviors and optimizing computational resource allocation.

Problem

Research questions and friction points this paper is trying to address.

Establishes scaling laws for data-constrained visual transfer learning

Analyzes knowledge distillation efficacy under limited downstream data

Identifies critical thresholds for distillation vs. pretraining dominance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Establishes data-efficient scaling laws framework

Proposes distillation boundary theory

Validates performance inflection points empirically

🔎 Similar Papers

No similar papers found.