CALM: Curiosity-Driven Auditing for Large Language Models

📅 2025-01-06
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of automated security auditing for black-box large language models (LLMs). We propose a prompt search framework based on intrinsically motivated reinforcement learning, enabling an auditing agent LLM to actively explore the sparse, discrete prompt space solely via API calls—without internal model access. The method efficiently generates input-output pairs that elicit toxic, hallucinated, or sensitive responses. Crucially, we introduce curiosity-driven exploration into black-box auditing, integrating reward shaping with toxicity- and hallucination-oriented reward modeling to guide targeted discovery of harmful behaviors. Experiments across multiple closed-source LLMs demonstrate significant improvements in audit efficiency and coverage over existing baselines. Our approach successfully uncovers high-risk vulnerabilities, including celebrity denigration and politically sensitive prompt induction, validating its effectiveness for systematic safety assessment of deployed LLMs.

Technology Category

Application Category

📝 Abstract
Auditing Large Language Models (LLMs) is a crucial and challenging task. In this study, we focus on auditing black-box LLMs without access to their parameters, only to the provided service. We treat this type of auditing as a black-box optimization problem where the goal is to automatically uncover input-output pairs of the target LLMs that exhibit illegal, immoral, or unsafe behaviors. For instance, we may seek a non-toxic input that the target LLM responds to with a toxic output or an input that induces the hallucinative response from the target LLM containing politically sensitive individuals. This black-box optimization is challenging due to the scarcity of feasible points, the discrete nature of the prompt space, and the large search space. To address these challenges, we propose Curiosity-Driven Auditing for Large Language Models (CALM), which uses intrinsically motivated reinforcement learning to finetune an LLM as the auditor agent to uncover potential harmful and biased input-output pairs of the target LLM. CALM successfully identifies derogatory completions involving celebrities and uncovers inputs that elicit specific names under the black-box setting. This work offers a promising direction for auditing black-box LLMs. Our code is available at https://github.com/x-zheng16/CALM.git.
Problem

Research questions and friction points this paper is trying to address.

Large Language Models
Ethical Safety
Input Sensitivity
Innovation

Methods, ideas, or system contributions that make the work stand out.

Curiosity-driven
Large Language Model Auditing
Black-box Model Evaluation
🔎 Similar Papers
No similar papers found.