Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

📅 2026-03-18

📈 Citations: 0

✨ Influential: 0

career value

163K/year

🤖 AI Summary

This work addresses the limitation of existing adapter-based CLIP fine-tuning methods, which rely on global unimodal features and overlook fine-grained alignment between local image patches and class-related textual prompts. To remedy this, the authors propose a heterogeneous graph supervision framework employed only during training. The approach constructs a teacher model comprising multi-scale image patches and textual prompts as a heterogeneous graph, leveraging a modality-aware graph Transformer and a discriminative node filtering mechanism to guide lightweight adapters toward learning superior prototype representations. Notably, the method introduces no modifications to the inference pipeline nor incurs additional computational overhead at test time. It achieves new state-of-the-art results across standard few-shot learning benchmarks with 1–16 samples per class, and ablation studies confirm the effectiveness of graph supervision, text guidance, and node filtering.

Technology Category

Application Category

📝 Abstract

Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.

Problem

Research questions and friction points this paper is trying to address.

few-shot learning

image-patch-text alignment

heterogeneous graph

CLIP adaptation

fine-grained relations

Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous graph supervision

modality-aware graph transformer

training-only distillation