Training-Only Heterogeneous Image-Patch-Text Graph Supervision for Advancing Few-Shot Learning Adapters

📅 2026-03-18
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limitation of existing adapter-based CLIP fine-tuning methods, which rely on global unimodal features and overlook fine-grained alignment between local image patches and class-related textual prompts. To remedy this, the authors propose a heterogeneous graph supervision framework employed only during training. The approach constructs a teacher model comprising multi-scale image patches and textual prompts as a heterogeneous graph, leveraging a modality-aware graph Transformer and a discriminative node filtering mechanism to guide lightweight adapters toward learning superior prototype representations. Notably, the method introduces no modifications to the inference pipeline nor incurs additional computational overhead at test time. It achieves new state-of-the-art results across standard few-shot learning benchmarks with 1–16 samples per class, and ablation studies confirm the effectiveness of graph supervision, text guidance, and node filtering.

Technology Category

Application Category

📝 Abstract
Recent adapter-based CLIP tuning (e.g., Tip-Adapter) is a strong few-shot learner, achieving efficiency by caching support features for fast prototype matching. However, these methods rely on global uni-modal feature vectors, overlooking fine-grained patch relations and their structural alignment with class text. To bridge this gap without incurring inference costs, we introduce a novel asymmetric training-only framework. Instead of altering the lightweight adapter, we construct a high-capacity auxiliary Heterogeneous Graph Teacher that operates solely during training. This teacher (i) integrates multi-scale visual patches and text prompts into a unified graph, (ii) performs deep cross-modal reasoning via a Modality-aware Graph Transformer (MGT), and (iii) applies discriminative node filtering to extract high-fidelity class features. Crucially, we employ a cache-aware dual-objective strategy to supervise this relational knowledge directly into the Tip-Adapter's key-value cache, effectively upgrading the prototypes while the graph teacher is discarded at test time. Thus, inference remains identical to Tip-Adapter with zero extra latency or memory. Across standard 1-16-shot benchmarks, our method consistently establishes a new state-of-the-art. Ablations confirm that the auxiliary graph supervision, text-guided reasoning, and node filtering are the essential ingredients for robust few-shot adaptation. Code is available at https://github.com/MR-Sherif/TOGA.git.
Problem

Research questions and friction points this paper is trying to address.

few-shot learning
image-patch-text alignment
heterogeneous graph
CLIP adaptation
fine-grained relations
Innovation

Methods, ideas, or system contributions that make the work stand out.

heterogeneous graph supervision
modality-aware graph transformer
training-only distillation
few-shot adapter learning
cross-modal reasoning
🔎 Similar Papers
No similar papers found.