Introspection Adapters: Training LLMs to Report Their Learned Behaviors

📅 2026-04-17

📈 Citations: 0

✨ Influential: 0

career value

203K/year

🤖 AI Summary

This work addresses the risk that fine-tuning large language models can inadvertently introduce latent harmful behaviors, which current auditing methods detect inefficiently. To mitigate this, the authors propose a universal LoRA-based Introspection Adapter that embeds controllable behavioral triggers across diverse fine-tuning paradigms and trains them jointly, enabling the model to articulate its learned behaviors in natural language. This approach achieves, for the first time, zero-shot self-reporting of behaviors across different fine-tuning strategies. Evaluated on AuditBench, it attains state-of-the-art performance in detecting both hidden harmful behaviors and encrypted fine-tuning API attacks. Moreover, its efficacy consistently improves with larger model scales and greater data diversity.

Technology Category

Application Category

📝 Abstract

When model developers or users fine-tune an LLM, this can induce behaviors that are unexpected, deliberately harmful, or hard to detect. It would be far easier to audit LLMs if they could simply describe their behaviors in natural language. Here, we study a scalable approach to rapidly identify learned behaviors of many LLMs derived from a shared base LLM. Given a model $M$, our method works by finetuning models $M_i$ from $M$ with implanted behaviors $b_i$; the $(M_i, b_i)$ pairs serve as labeled training data. We then train an \emph{introspection adapter} (IA): a single LoRA adapter jointly trained across the finetunes $M_i$ to cause them to verbalize their implanted behaviors. We find that this IA induces self-description of learned behaviors even in finetunes of $M$ that were trained in very different ways from the $M_i$. For example, IAs generalize to AuditBench, achieving state-of-the-art at identifying explicitly hidden concerning behaviors. IAs can also be used to detect encrypted finetuning API attacks. They scale favorably with model size and training data diversity. Overall, our results suggest that IAs are a scalable, effective, and practically useful approach to auditing fine-tuned LLMs.

Problem

Research questions and friction points this paper is trying to address.

LLM auditing

learned behaviors

fine-tuning

behavior detection

model introspection

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introspection Adapter

LoRA

behavior reporting