🤖 AI Summary
This work addresses the challenge of auditing whether large language models improperly leverage protected retrieval contexts during reinforcement learning (RL) fine-tuning. The authors propose a "behavioral canary" mechanism that injects document-level triggers into preference data paired with stylistically distinct feedback, thereby inducing detectable conditional behavioral preferences in the model. This approach enables reliable auditing of non-memorized, stylistic influences from training data. The framework encompasses trigger-based preference data generation, detection of conditional behavioral signals, and RL fine-tuning auditing. With only a 1% injection rate, the method achieves a 67% detection rate at a 10% false positive rate and an AUROC of 0.756, marking the first effective technique for tracing such covert data misuse.
📝 Abstract
In agentic workflows, LLMs frequently process retrieved contexts that are legally protected from further training. However, auditors currently lack a reliable way to verify if a provider has violated the terms of service by incorporating these data into post-training, especially through Reinforcement Learning (RL). While standard auditing relies on verbatim memorization and membership inference, these methods are ineffective for RL-trained models, as RL primarily influences a model's behavioral style rather than the retention of specific facts. To bridge this gap, we introduce Behavioral Canaries, a new auditing mechanism for RLFT pipelines. The framework instruments preference data by pairing document triggers with feedback that rewards a distinctive stylistic response, inducing a latent trigger-conditioned preference if such data are used in training. Empirical results show that these behavioral signals enable detection of unauthorized document-conditioned training, achieving a 67% detection rate at a 10% false-positive rate (AUROC = 0.756) at a 1% canary injection rate. More broadly, our results establish behavioral canaries as a new auditing mechanism for RLFT pipelines, enabling auditors to test for training-time influence even when such influence manifests as distributional behavioral change rather than memorization.