🤖 AI Summary
This work addresses context feature learning in open-set few-shot classification—specifically, automatically discovering task-relevant contextual cues from limited labeled samples and abundant unlabeled data to enable semantic expansion and vision-driven, ad-hoc category construction. Methodologically, we propose the first approach that injects learnable context tokens into the input of a frozen CLIP encoder, jointly optimizing image-text contrastive alignment and graph-structured visual clustering (GCD). This design ensures both class interpretability—e.g., precise localization of hands, faces, or backgrounds—and strong cross-task generalization. On Stanford Mood and CLEVR-4 benchmarks, our method achieves state-of-the-art performance: 87.4% accuracy on novel classes in Stanford Mood, surpassing prior baselines by over 50%; it also generates high-fidelity, task-aware saliency maps with explicit semantic grounding.
📝 Abstract
Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it.
Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective.
On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.