An Auditing Test To Detect Behavioral Shift in Language Models

📅 2024-10-25

🏛️ International Conference on Learning Representations

📈 Citations: 0

✨ Influential: 0

career value

187K/year

🤖 AI Summary

To address unexpected behavioral shifts in language models (LMs) following fine-tuning or deployment, this paper introduces Behavioral Shift Auditing (BSA), a continuous monitoring framework. BSA operates without access to model parameters or gradients, and—uniquely—establishes the first unsupervised, statistical hypothesis testing framework for text generation comparison, leveraging the Kolmogorov–Smirnov test and bootstrap resampling to reliably detect distributional shifts in critical capabilities such as toxicity and translation. The method provides theoretically grounded false positive control and supports configurable tolerance thresholds to accommodate diverse application scenarios. Experiments demonstrate that BSA achieves stable detection of significant behavioral shifts using only hundreds of samples, attaining high sensitivity and low false positive rates on both toxicity and machine translation tasks. Overall, BSA establishes a lightweight, robust, and interpretable paradigm for continuous auditing of LM behavioral evolution.

Technology Category

Application Category

📝 Abstract

As language models (LMs) approach human-level performance, a comprehensive understanding of their behavior becomes crucial. This includes evaluating capabilities, biases, task performance, and alignment with societal values. Extensive initial evaluations, including red teaming and diverse benchmarking, can establish a model's behavioral profile. However, subsequent fine-tuning or deployment modifications may alter these behaviors in unintended ways. We present a method for continual Behavioral Shift Auditing (BSA) in LMs. Building on recent work in hypothesis testing, our auditing test detects behavioral shifts solely through model generations. Our test compares model generations from a baseline model to those of the model under scrutiny and provides theoretical guarantees for change detection while controlling false positives. The test features a configurable tolerance parameter that adjusts sensitivity to behavioral changes for different use cases. We evaluate our approach using two case studies: monitoring changes in (a) toxicity and (b) translation performance. We find that the test is able to detect meaningful changes in behavior distributions using just hundreds of examples.

Problem

Research questions and friction points this paper is trying to address.

Detect unintended behavioral shifts in language models post-deployment

Monitor changes in model outputs like toxicity and translation

Provide a configurable auditing test with theoretical guarantees

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral Shift Auditing via model generations

Configurable tolerance for sensitivity adjustment

Detects changes with theoretical guarantees

🔎 Similar Papers

Does Liking Yellow Imply Driving a School Bus? Semantic Leakage in Language Models