Behavioral Canaries: Auditing Private Retrieved Context Usage in RL Fine-Tuning

📅 2026-04-23

📈 Citations: 0

✨ Influential: 0

career value

176K/year

🤖 AI Summary

This work addresses the challenge of auditing whether large language models improperly leverage protected retrieval contexts during reinforcement learning (RL) fine-tuning. The authors propose a "behavioral canary" mechanism that injects document-level triggers into preference data paired with stylistically distinct feedback, thereby inducing detectable conditional behavioral preferences in the model. This approach enables reliable auditing of non-memorized, stylistic influences from training data. The framework encompasses trigger-based preference data generation, detection of conditional behavioral signals, and RL fine-tuning auditing. With only a 1% injection rate, the method achieves a 67% detection rate at a 10% false positive rate and an AUROC of 0.756, marking the first effective technique for tracing such covert data misuse.

Technology Category

Application Category

📝 Abstract

In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.

Problem

Research questions and friction points this paper is trying to address.

auditing

retrieved context

Reinforcement Learning

private data

LLM fine-tuning

Innovation

Methods, ideas, or system contributions that make the work stand out.

Behavioral Canaries

Reinforcement Learning Fine-Tuning

Auditing