🤖 AI Summary
Text anomaly detection lacks standardized benchmarks, hindering rigorous methodological comparison and innovation. To address this, we introduce the first LLM-embedding-based benchmark for text anomaly detection, encompassing multiple models (GloVe, BERT, LLaMA, Mistral, OpenAI) and diverse domains. We systematically evaluate combinations of these embeddings with classical detectors—including KNN and Isolation Forest—under uniform experimental protocols. Key findings: embedding quality is the dominant factor governing detection performance; shallow detectors achieve competitive AUROC/AUPRC scores when paired with high-quality embeddings, matching deeper alternatives; and cross-model performance matrices exhibit low-rank structure, enabling efficient detector selection. All experiments employ comprehensive evaluation metrics (AUROC and AUPRC), and we publicly release the embedded datasets, source code, and evaluation framework. This benchmark establishes a reproducible, extensible foundation for advancing text anomaly detection research.
📝 Abstract
Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.