TokenCLIP: Token-wise Prompt Learning for Zero-shot Anomaly Detection

📅 2025-10-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing zero-shot anomaly detection methods rely on a single text embedding space for vision–semantics alignment, limiting their capacity to model diverse and fine-grained anomaly semantics. To address this, we propose a token-wise dynamic alignment framework that decomposes the unified text space into orthogonal subspace components. Our method introduces learnable subspaces, a semantic affinity-driven dynamic allocation mechanism, and integrates optimal transport modeling with a top-k sparse masking strategy to enable discriminative, fine-grained cross-modal matching. This design enhances both localization and identification of unseen anomalous objects. Extensive experiments demonstrate state-of-the-art performance on standard benchmarks including MVTec-AD and VisA, significantly outperforming prior approaches. The results validate the framework’s superior generalization capability and enhanced semantic expressiveness in zero-shot anomaly detection.

Technology Category

Application Category

📝 Abstract
Adapting CLIP for anomaly detection on unseen objects has shown strong potential in a zero-shot manner. However, existing methods typically rely on a single textual space to align with visual semantics across diverse objects and domains. The indiscriminate alignment hinders the model from accurately capturing varied anomaly semantics. We propose TokenCLIP, a token-wise adaptation framework that enables dynamic alignment between visual and learnable textual spaces for fine-grained anomaly learning. Rather than mapping all visual tokens to a single, token-agnostic textual space, TokenCLIP aligns each token with a customized textual subspace that represents its visual characteristics. Explicitly assigning a unique learnable textual space to each token is computationally intractable and prone to insufficient optimization. We instead expand the token-agnostic textual space into a set of orthogonal subspaces, and then dynamically assign each token to a subspace combination guided by semantic affinity, which jointly supports customized and efficient token-wise adaptation. To this end, we formulate dynamic alignment as an optimal transport problem, where all visual tokens in an image are transported to textual subspaces based on semantic similarity. The transport constraints of OT ensure sufficient optimization across subspaces and encourage them to focus on different semantics. Solving the problem yields a transport plan that adaptively assigns each token to semantically relevant subspaces. A top-k masking is then applied to sparsify the plan and specialize subspaces for distinct visual regions. Extensive experiments demonstrate the superiority of TokenCLIP.
Problem

Research questions and friction points this paper is trying to address.

Dynamic alignment between visual tokens and customized textual subspaces for anomaly detection
Overcoming indiscriminate semantic alignment across diverse objects and domains
Enabling fine-grained anomaly learning through token-wise optimal transport adaptation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-wise adaptation framework for fine-grained anomaly learning
Dynamic alignment using orthogonal textual subspace combinations
Optimal transport problem formulation for semantic token assignment
🔎 Similar Papers
No similar papers found.