Open Ad-hoc Categorization with Contextualized Feature Learning

📅 2025-12-18

📈 Citations: 0

✨ Influential: 0

career value

199K/year

🤖 AI Summary

This work addresses context feature learning in open-set few-shot classification—specifically, automatically discovering task-relevant contextual cues from limited labeled samples and abundant unlabeled data to enable semantic expansion and vision-driven, ad-hoc category construction. Methodologically, we propose the first approach that injects learnable context tokens into the input of a frozen CLIP encoder, jointly optimizing image-text contrastive alignment and graph-structured visual clustering (GCD). This design ensures both class interpretability—e.g., precise localization of hands, faces, or backgrounds—and strong cross-task generalization. On Stanford Mood and CLEVR-4 benchmarks, our method achieves state-of-the-art performance: 87.4% accuracy on novel classes in Stanford Mood, surpassing prior baselines by over 50%; it also generates high-fidelity, task-aware saliency maps with explicit semantic grounding.

Technology Category

Application Category

📝 Abstract

Adaptive categorization of visual scenes is essential for AI agents to handle changing tasks. Unlike fixed common categories for plants or animals, ad-hoc categories are created dynamically to serve specific goals. We study open ad-hoc categorization: Given a few labeled exemplars and abundant unlabeled data, the goal is to discover the underlying context and to expand ad-hoc categories through semantic extension and visual clustering around it. Building on the insight that ad-hoc and common categories rely on similar perceptual mechanisms, we propose OAK, a simple model that introduces a small set of learnable context tokens at the input of a frozen CLIP and optimizes with both CLIP's image-text alignment objective and GCD's visual clustering objective. On Stanford and Clevr-4 datasets, OAK achieves state-of-the-art in accuracy and concept discovery across multiple categorizations, including 87.4% novel accuracy on Stanford Mood, surpassing CLIP and GCD by over 50%. Moreover, OAK produces interpretable saliency maps, focusing on hands for Action, faces for Mood, and backgrounds for Location, promoting transparency and trust while enabling adaptive and generalizable categorization.

Problem

Research questions and friction points this paper is trying to address.

Adaptive categorization of visual scenes for AI agents

Discovering underlying context and expanding ad-hoc categories

Achieving state-of-the-art accuracy and interpretable saliency maps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses learnable context tokens with frozen CLIP

Combines image-text alignment and visual clustering objectives

Generates interpretable saliency maps for transparency

🔎 Similar Papers

No similar papers found.