Text-ADBench: Text Anomaly Detection Benchmark based on LLMs Embedding

📅 2025-07-16

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Text anomaly detection lacks standardized benchmarks, hindering rigorous methodological comparison and innovation. To address this, we introduce the first LLM-embedding-based benchmark for text anomaly detection, encompassing multiple models (GloVe, BERT, LLaMA, Mistral, OpenAI) and diverse domains. We systematically evaluate combinations of these embeddings with classical detectors—including KNN and Isolation Forest—under uniform experimental protocols. Key findings: embedding quality is the dominant factor governing detection performance; shallow detectors achieve competitive AUROC/AUPRC scores when paired with high-quality embeddings, matching deeper alternatives; and cross-model performance matrices exhibit low-rank structure, enabling efficient detector selection. All experiments employ comprehensive evaluation metrics (AUROC and AUPRC), and we publicly release the embedded datasets, source code, and evaluation framework. This benchmark establishes a reproducible, extensible foundation for advancing text anomaly detection research.

Technology Category

Application Category

📝 Abstract

Text anomaly detection is a critical task in natural language processing (NLP), with applications spanning fraud detection, misinformation identification, spam detection and content moderation, etc. Despite significant advances in large language models (LLMs) and anomaly detection algorithms, the absence of standardized and comprehensive benchmarks for evaluating the existing anomaly detection methods on text data limits rigorous comparison and development of innovative approaches. This work performs a comprehensive empirical study and introduces a benchmark for text anomaly detection, leveraging embeddings from diverse pre-trained language models across a wide array of text datasets. Our work systematically evaluates the effectiveness of embedding-based text anomaly detection by incorporating (1) early language models (GloVe, BERT); (2) multiple LLMs (LLaMa-2, LLama-3, Mistral, OpenAI (small, ada, large)); (3) multi-domain text datasets (news, social media, scientific publications); (4) comprehensive evaluation metrics (AUROC, AUPRC). Our experiments reveal a critical empirical insight: embedding quality significantly governs anomaly detection efficacy, and deep learning-based approaches demonstrate no performance advantage over conventional shallow algorithms (e.g., KNN, Isolation Forest) when leveraging LLM-derived embeddings.In addition, we observe strongly low-rank characteristics in cross-model performance matrices, which enables an efficient strategy for rapid model evaluation (or embedding evaluation) and selection in practical applications. Furthermore, by open-sourcing our benchmark toolkit that includes all embeddings from different models and code at https://github.com/jicongfan/Text-Anomaly-Detection-Benchmark, this work provides a foundation for future research in robust and scalable text anomaly detection systems.

Problem

Research questions and friction points this paper is trying to address.

Lack of standardized benchmarks for text anomaly detection evaluation

Assessing embedding quality impact on anomaly detection performance

Comparing shallow vs deep learning methods with LLM embeddings

Innovation

Methods, ideas, or system contributions that make the work stand out.

Leverages diverse LLM embeddings for anomaly detection

Systematically evaluates embedding-based detection methods

Open-sources benchmark toolkit for future research

🔎 Similar Papers

Large Language Models for Anomaly and Out-of-Distribution Detection: A Survey

2024-09-03arXiv.orgCitations: 1

Authors to Follow