Seeing What Matters: Empowering CLIP with Patch Generation-to-Selection

📅 2025-03-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
CLIP training incurs substantial computational and memory overhead, while existing masking strategies often degrade image-text alignment quality due to semantic information loss. To address this, we propose the Patch Generation-to-Selection (PGS) framework: first, structure-aware candidate image patches are generated via Sobel edge detection; second, neighborhood similarity modeling coupled with optimal transport is employed to normalize the similarity matrix, enabling progressive selection of salient patches. PGS jointly preserves edge structural integrity and object-level semantic fidelity while significantly reducing computational and memory costs. Evaluated on zero-shot classification and cross-modal retrieval, PGS achieves new state-of-the-art performance. Notably, it enhances model robustness against input perturbations and improves generalization to unseen linguistic compositions—demonstrating superior alignment quality under resource-constrained settings.

Technology Category

Application Category

📝 Abstract
The CLIP model has demonstrated significant advancements in aligning visual and language modalities through large-scale pre-training on image-text pairs, enabling strong zero-shot classification and retrieval capabilities on various domains. However, CLIP's training remains computationally intensive, with high demands on both data processing and memory. To address these challenges, recent masking strategies have emerged, focusing on the selective removal of image patches to improve training efficiency. Although effective, these methods often compromise key semantic information, resulting in suboptimal alignment between visual features and text descriptions. In this work, we present a concise yet effective approach called Patch Generation-to-Selection to enhance CLIP's training efficiency while preserving critical semantic content. Our method introduces a gradual masking process in which a small set of candidate patches is first pre-selected as potential mask regions. Then, we apply Sobel edge detection across the entire image to generate an edge mask that prioritizes the retention of the primary object areas. Finally, similarity scores between the candidate mask patches and their neighboring patches are computed, with optimal transport normalization refining the selection process to ensure a balanced similarity matrix. Our approach, CLIP-PGS, sets new state-of-the-art results in zero-shot classification and retrieval tasks, achieving superior performance in robustness evaluation and language compositionality benchmarks.
Problem

Research questions and friction points this paper is trying to address.

Improve CLIP training efficiency without losing key semantics
Balance patch selection to preserve primary object information
Enhance zero-shot classification and retrieval performance robustness
Innovation

Methods, ideas, or system contributions that make the work stand out.

Gradual masking process for patch selection
Sobel edge detection to retain object areas
Optimal transport normalization for balanced similarity
🔎 Similar Papers
No similar papers found.
G
Gensheng Pei
Nanjing University of Science and Technology
T
Tao Chen
Nanjing University of Science and Technology
Y
Yujia Wang
Zhejiang Sci-Tech University
Xinhao Cai
Xinhao Cai
Nanjing University of Science and Technology
computer visionmachine learning
X
Xiangbo Shu
Nanjing University of Science and Technology
Tianfei Zhou
Tianfei Zhou
Beijing Institute of Technology | ETH Zurich
Artificial IntelligenceMedical AIComputer Vision
Y
Yazhou Yao
Nanjing University of Science and Technology