Provably Improving Generalization of Few-Shot Models with Synthetic Data

📅 2025-05-30

📈 Citations: 0

✨ Influential: 0

career value

175K/year

🤖 AI Summary

Few-shot image classification suffers from both limited labeled data and distributional shift between synthetic and real data, degrading generalization. This paper establishes the first theoretical model quantifying how synthetic-to-real distribution discrepancy affects generalization error, and proposes a theory-driven synthetic data generation framework. We further design a joint data partitioning and training algorithm that unifies prototype learning with distributionally robust optimization, embedded within a few-shot meta-training paradigm. Extensive experiments on multiple benchmarks demonstrate significant improvements over state-of-the-art methods, validating that theoretically grounded, distribution-aware synthetic data generation enhances model generalization. Our core contributions are: (1) the first theoretical generalization bound explicitly incorporating synthetic-to-real distribution divergence; and (2) the first end-to-end framework jointly optimizing prototype learning, distributional robustness, and meta-training.

Technology Category

Application Category

📝 Abstract

Few-shot image classification remains challenging due to the scarcity of labeled training examples. Augmenting them with synthetic data has emerged as a promising way to alleviate this issue, but models trained on synthetic samples often face performance degradation due to the inherent gap between real and synthetic distributions. To address this limitation, we develop a theoretical framework that quantifies the impact of such distribution discrepancies on supervised learning, specifically in the context of image classification. More importantly, our framework suggests practical ways to generate good synthetic samples and to train a predictor with high generalization ability. Building upon this framework, we propose a novel theoretical-based algorithm that integrates prototype learning to optimize both data partitioning and model training, effectively bridging the gap between real few-shot data and synthetic data. Extensive experiments results show that our approach demonstrates superior performance compared to state-of-the-art methods, outperforming them across multiple datasets.

Problem

Research questions and friction points this paper is trying to address.

Addressing performance degradation in few-shot models due to synthetic-real data gaps

Developing a theoretical framework to quantify distribution discrepancies in supervised learning

Proposing a prototype-based algorithm to optimize data partitioning and model training

Innovation

Methods, ideas, or system contributions that make the work stand out.

Theoretical framework quantifying synthetic data impact

Algorithm integrating prototype learning optimization

Bridging gap between real and synthetic data

🔎 Similar Papers

No similar papers found.