You've Changed: Detecting Modification of Black-Box Large Language Models

📅 2025-04-14

📈 Citations: 0

✨ Influential: 0

career value

205K/year

🤖 AI Summary

This work addresses the challenge of real-time detection of behavioral shifts in black-box large language models (LLMs) hosted via API, where internal model parameters and gradients are inaccessible. We propose a lightweight online change detection method grounded in statistical equivalence testing—specifically, the Kolmogorov–Smirnov (KS) test and Maximum Mean Discrepancy (MMD)—applied to linguistically and psycholinguistically derived features of generated text (e.g., lexical frequency, syntactic complexity, readability, sentiment polarity). Crucially, our approach requires no model access, fine-tuning, or architectural modifications. It enables high-frequency, low-overhead monitoring (<0.5 seconds per inference) and naturally extends to detecting prompt injection attacks. Evaluated across five OpenAI models and Llama-3-70B, our method achieves >92% accuracy in identifying behavioral changes induced by version updates or fine-tuning, substantially outperforming existing baselines.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are often provided as a service via an API, making it challenging for developers to detect changes in their behavior. We present an approach to monitor LLMs for changes by comparing the distributions of linguistic and psycholinguistic features of generated text. Our method uses a statistical test to determine whether the distributions of features from two samples of text are equivalent, allowing developers to identify when an LLM has changed. We demonstrate the effectiveness of our approach using five OpenAI completion models and Meta's Llama 3 70B chat model. Our results show that simple text features coupled with a statistical test can distinguish between language models. We also explore the use of our approach to detect prompt injection attacks. Our work enables frequent LLM change monitoring and avoids computationally expensive benchmark evaluations.

Problem

Research questions and friction points this paper is trying to address.

Detect behavior changes in black-box LLM APIs

Monitor LLM changes using linguistic feature distributions

Identify prompt injection attacks via statistical testing

Innovation

Methods, ideas, or system contributions that make the work stand out.

Monitor LLM changes via linguistic feature distributions

Use statistical test for feature distribution equivalence

Detect prompt injection attacks with text features

🔎 Similar Papers

A Comprehensive Survey of Contamination Detection Methods in Large Language Models