๐ค AI Summary
In unsupervised anomaly detection, conventional methods struggle to precisely model the normal boundary due to data scarcity and confounding attributes, and fail to identify user-specified anomalies. To address this, we propose a language-guided feature transformation frameworkโthe first to leverage the cross-modal shared embedding space of vision-language models (e.g., CLIP) for unsupervised anomaly detection. Our method employs natural language instructions to drive semantic-level feature recalibration, enabling fine-grained, interpretable, and user-preference-aligned detection. It supports prompt-based feature projection and is plug-and-play compatible with mainstream detectors (e.g., GAN- or reconstruction-based). Extensive experiments on multiple real-world benchmarks demonstrate an average 18.7% improvement in target anomaly recall while maintaining high accuracy on normal samples, validating both effectiveness and generalizability.
๐ Abstract
This paper introduces LAFT, a novel feature transformation method designed to incorporate user knowledge and preferences into anomaly detection using natural language. Accurately modeling the boundary of normality is crucial for distinguishing abnormal data, but this is often challenging due to limited data or the presence of nuisance attributes. While unsupervised methods that rely solely on data without user guidance are common, they may fail to detect anomalies of specific interest. To address this limitation, we propose Language-Assisted Feature Transformation (LAFT), which leverages the shared image-text embedding space of vision-language models to transform visual features according to user-defined requirements. Combined with anomaly detection methods, LAFT effectively aligns visual features with user preferences, allowing anomalies of interest to be detected. Extensive experiments on both toy and real-world datasets validate the effectiveness of our method.