🤖 AI Summary
In few-shot transductive learning, existing methods rely on manually tuned statistical hyperparameters—e.g., class-balancing coefficients—whose optimal values vary significantly across datasets and pretrained backbones; validation-based hyperparameter search is both inefficient and non-scalable. This work introduces, for the first time, a “learning-to-optimize” paradigm into few-shot learning, proposing a generalized Expectation-Maximization (EM) unfolding framework: EM iterations are modeled as differentiable neural networks, enabling end-to-end adaptive learning of hyperparameters. The method is compatible with both vision-only and vision-language pretrained models, supporting cross-modal feature adaptation and meta-optimization. On fine-grained image classification benchmarks, it achieves +10% and +7.5% accuracy gains over standard iterative EM for vision-only and vision-language settings, respectively, while substantially reducing manual hyperparameter tuning overhead.
📝 Abstract
Transductive few-shot learning has recently triggered wide attention in computer vision. Yet, current methods introduce key hyper-parameters, which control the prediction statistics of the test batches, such as the level of class balance, affecting performances significantly. Such hyper-parameters are empirically grid-searched over validation data, and their configurations may vary substantially with the target dataset and pre-training model, making such empirical searches both sub-optimal and computationally intractable. In this work, we advocate and introduce the unrolling paradigm, also referred to as"learning to optimize", in the context of few-shot learning, thereby learning efficiently and effectively a set of optimized hyper-parameters. Specifically, we unroll a generalization of the ubiquitous Expectation-Maximization (EM) optimizer into a neural network architecture, mapping each of its iterates to a layer and learning a set of key hyper-parameters over validation data. Our unrolling approach covers various statistical feature distributions and pre-training paradigms, including recent foundational vision-language models and standard vision-only classifiers. We report comprehensive experiments, which cover a breadth of fine-grained downstream image classification tasks, showing significant gains brought by the proposed unrolled EM algorithm over iterative variants. The achieved improvements reach up to 10% and 7.5% on vision-only and vision-language benchmarks, respectively.