🤖 AI Summary
This work addresses the problem of language model behavior elicitation—i.e., systematically triggering targeted model responses (e.g., harmful outputs or hallucinations) for safety evaluation. We propose an interpretable and diverse adversarial prompt generation method based on automated search. Our core methodological innovation is a Frank-Wolfe–based iterative training objective that jointly integrates supervised fine-tuning and DPO-based reinforcement learning, augmented by amortized Bayesian inference to model the distribution over effective prompting strategies. This enables an investigator agent to efficiently synthesize semantically rich, human-understandable triggers. On the AdvBench subset, our approach achieves 100% attack success rate for eliciting harmful outputs and an 85% hallucination induction rate—substantially outperforming existing heuristic- or gradient-based optimization methods. The framework establishes a new paradigm for rigorous model safety assessment and alignment mechanism analysis.
📝 Abstract
Language models exhibit complex, diverse behaviors when prompted with free-form text, making it difficult to characterize the space of possible outputs. We study the problem of behavior elicitation, where the goal is to search for prompts that induce specific target behaviors (e.g., hallucinations or harmful responses) from a target language model. To navigate the exponentially large space of possible prompts, we train investigator models to map randomly-chosen target behaviors to a diverse distribution of outputs that elicit them, similar to amortized Bayesian inference. We do this through supervised fine-tuning, reinforcement learning via DPO, and a novel Frank-Wolfe training objective to iteratively discover diverse prompting strategies. Our investigator models surface a variety of effective and human-interpretable prompts leading to jailbreaks, hallucinations, and open-ended aberrant behaviors, obtaining a 100% attack success rate on a subset of AdvBench (Harmful Behaviors) and an 85% hallucination rate.