🤖 AI Summary
This work addresses the challenge of real-time detection of behavioral shifts in black-box large language models (LLMs) hosted via API, where internal model parameters and gradients are inaccessible. We propose a lightweight online change detection method grounded in statistical equivalence testing—specifically, the Kolmogorov–Smirnov (KS) test and Maximum Mean Discrepancy (MMD)—applied to linguistically and psycholinguistically derived features of generated text (e.g., lexical frequency, syntactic complexity, readability, sentiment polarity). Crucially, our approach requires no model access, fine-tuning, or architectural modifications. It enables high-frequency, low-overhead monitoring (<0.5 seconds per inference) and naturally extends to detecting prompt injection attacks. Evaluated across five OpenAI models and Llama-3-70B, our method achieves >92% accuracy in identifying behavioral changes induced by version updates or fine-tuning, substantially outperforming existing baselines.
📝 Abstract
Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text. Our method uses a statistical test to determine whether the distributions of features from two samples of text are equivalent, allowing developers to identify when an LLM has changed. We demonstrate the effectiveness of our approach using five OpenAI completion models and Meta's Llama 3 70B chat model. Our results show that simple text features coupled with a statistical test can distinguish between language models. We also explore the use of our approach to detect prompt injection attacks. Our work enables frequent LLM change monitoring and avoids computationally expensive benchmark evaluations.