🤖 AI Summary
In LLM security auditing, limited query budgets and the difficulty of efficiently detecting catastrophic responses—such as illegal or discriminatory outputs—pose significant challenges.
Method: This paper introduces “Output Scouting,” a novel paradigm that pioneers target-distribution-guided semantic fluency generation for systematic failure discovery. Built upon Hugging Face Transformers, our approach establishes a probabilistically controllable prompt engineering framework integrating distribution-matching sampling and semantic-constrained decoding to deliberately elicit high-risk responses while preserving output naturalness.
Contribution/Results: We successfully identify dozens of real-world catastrophic failures across two mainstream LLMs. We publicly release a fully reproducible auditing toolkit—the first practical benchmark for industrial-scale LLM security assessment that jointly ensures efficiency, controllability, and interpretability.
📝 Abstract
Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. In this work, we explore the following scenario: imagine an AI safety auditor is searching for catastrophic responses from an LLM (e.g. a"yes"responses to"can I fire an employee for being pregnant?"), and is able to query the model a limited number times (e.g. 1000 times). What is a strategy for querying the model that would efficiently find those failure responses? To this end, we propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution. We then run experiments using two LLMs and find numerous examples of catastrophic responses. We conclude with a discussion that includes advice for practitioners who are looking to implement LLM auditing for catastrophic responses. We also release an open-source toolkit (https://github.com/joaopfonseca/outputscouting) that implements our auditing framework using the Hugging Face transformers library.