TeD-Loc: Text Distillation for Weakly Supervised Object Localization

📅 2025-01-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Weakly supervised object localization (WSOL) suffers from imprecise localization of complete object regions due to reliance solely on image-level class labels. To address this, we propose an end-to-end vision backbone text-knowledge distillation framework. For the first time, we directly distill semantic priors from CLIP’s text embeddings into the visual encoder, enabling joint optimization of image-patch-level localization and classification—without external classifiers, ground-truth localization annotations, or generative prompt learning. Our method integrates multi-instance learning (MIL) with cross-modal alignment, achieving ~5% improvement in Top-1 localization accuracy on CUB and ILSVRC, while exhibiting significantly lower computational complexity than GenPrompt. Key contributions include: (i) establishing a novel paradigm of text-embedding distillation for WSOL; (ii) enabling synchronous convergence of localization and classification within a single model; and (iii) achieving both high accuracy and efficiency.

Technology Category

Application Category

📝 Abstract
Weakly supervised object localization (WSOL) using classification models trained with only image-class labels remains an important challenge in computer vision. Given their reliance on classification objectives, traditional WSOL methods like class activation mapping focus on the most discriminative object parts, often missing the full spatial extent. In contrast, recent WSOL methods based on vision-language models like CLIP require ground truth classes or external classifiers to produce a localization map, limiting their deployment in downstream tasks. Moreover, methods like GenPromp attempt to address these issues but introduce considerable complexity due to their reliance on conditional denoising processes and intricate prompt learning. This paper introduces Text Distillation for Localization (TeD-Loc), an approach that directly distills knowledge from CLIP text embeddings into the model backbone and produces patch-level localization. Multiple instance learning of these image patches allows for accurate localization and classification using one model without requiring external classifiers. Such integration of textual and visual modalities addresses the longstanding challenge of achieving accurate localization and classification concurrently, as WSOL methods in the literature typically converge at different epochs. Extensive experiments show that leveraging text embeddings and localization cues provides a cost-effective WSOL model. TeD-Loc improves Top-1 LOC accuracy over state-of-the-art models by about 5% on both CUB and ILSVRC datasets, while significantly reducing computational complexity compared to GenPromp.
Problem

Research questions and friction points this paper is trying to address.

Weakly Supervised Object Localization
Computer Vision
Visual Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

TeD-Loc
CLIP Integration
WSOL Efficiency
🔎 Similar Papers
No similar papers found.