Introspection Adapters: Training LLMs to Report Their Learned Behaviors

πŸ“… 2026-04-17
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

205K/year
πŸ€– AI Summary
This work addresses the risk that fine-tuning large language models can inadvertently introduce latent harmful behaviors, which current auditing methods detect inefficiently. To mitigate this, the authors propose a universal LoRA-based Introspection Adapter that embeds controllable behavioral triggers across diverse fine-tuning paradigms and trains them jointly, enabling the model to articulate its learned behaviors in natural language. This approach achieves, for the first time, zero-shot self-reporting of behaviors across different fine-tuning strategies. Evaluated on AuditBench, it attains state-of-the-art performance in detecting both hidden harmful behaviors and encrypted fine-tuning API attacks. Moreover, its efficacy consistently improves with larger model scales and greater data diversity.

Technology Category

Application Category

πŸ“ Abstract
When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$, our method works by finetuning models $M_i$ from $M$ with implanted behaviors $b_i$; the $(M_i, b_i)$ pairs serve as labeled training data. We then train an \emph{introspection adapter} (IA): a single LoRA adapter jointly trained across the finetunes $M_i$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_i$. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.
Problem

Research questions and friction points this paper is trying to address.

LLM auditing
learned behaviors
fine-tuning
behavior detection
model introspection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspection Adapter
LoRA
behavior reporting
LLM auditing
fine-tuning detection
πŸ”Ž Similar Papers
No similar papers found.