Output Scouting: Auditing Large Language Models for Catastrophic Responses

📅 2024-10-04
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
In LLM security auditing, limited query budgets and the difficulty of efficiently detecting catastrophic responses—such as illegal or discriminatory outputs—pose significant challenges. Method: This paper introduces “Output Scouting,” a novel paradigm that pioneers target-distribution-guided semantic fluency generation for systematic failure discovery. Built upon Hugging Face Transformers, our approach establishes a probabilistically controllable prompt engineering framework integrating distribution-matching sampling and semantic-constrained decoding to deliberately elicit high-risk responses while preserving output naturalness. Contribution/Results: We successfully identify dozens of real-world catastrophic failures across two mainstream LLMs. We publicly release a fully reproducible auditing toolkit—the first practical benchmark for industrial-scale LLM security assessment that jointly ensures efficiency, controllability, and interpretability.

Technology Category

Application Category

📝 Abstract
Recent high profile incidents in which the use of Large Language Models (LLMs) resulted in significant harm to individuals have brought about a growing interest in AI safety. One reason LLM safety issues occur is that models often have at least some non-zero probability of producing harmful outputs. In this work, we explore the following scenario: imagine an AI safety auditor is searching for catastrophic responses from an LLM (e.g. a"yes"responses to"can I fire an employee for being pregnant?"), and is able to query the model a limited number times (e.g. 1000 times). What is a strategy for querying the model that would efficiently find those failure responses? To this end, we propose output scouting: an approach that aims to generate semantically fluent outputs to a given prompt matching any target probability distribution. We then run experiments using two LLMs and find numerous examples of catastrophic responses. We conclude with a discussion that includes advice for practitioners who are looking to implement LLM auditing for catastrophic responses. We also release an open-source toolkit (https://github.com/joaopfonseca/outputscouting) that implements our auditing framework using the Hugging Face transformers library.
Problem

Research questions and friction points this paper is trying to address.

Finding catastrophic responses in LLMs efficiently
Developing strategies for limited query audits
Proposing output scouting for harmful output detection
Innovation

Methods, ideas, or system contributions that make the work stand out.

Output scouting for auditing LLM catastrophic responses
Generates fluent outputs matching target probability distributions
Open-source toolkit for Hugging Face transformers implementation
🔎 Similar Papers
No similar papers found.
Andrew Bell
Andrew Bell
New York University
artificial intelligencemachine learningexplainabilityfairness
J
João Fonseca
Department of Computer Science, New York University