🤖 AI Summary
This paper introduces and formally defines the “refusal discovery” problem—systematically identifying the set of taboo topics that large language models (LLMs) refuse to address. To this end, we propose LLM-crawler, a method that actively probes model safety boundaries via token pre-filling and crawler-style prompt exploration. Key contributions include: (1) uncovering novel alignment failures such as “thought suppression”; (2) detecting latent ideological alignment biases in quantized models; (3) achieving high recall—31 out of 36 prohibited topics identified within 1,000 queries on Tulu-3-8B; and (4) providing the first empirical evidence that DeepSeek-R1-70B exhibits memory-like tendencies in generating responses related to the Communist Party of China. Experiments span Tulu-3, Claude-Haiku, Llama-3.3-70B, and their quantized variants, establishing a cross-model safety benchmark and introducing a new paradigm for LLM safety evaluation.
📝 Abstract
Refusal discovery is the task of identifying the full set of topics that a language model refuses to discuss. We introduce this new problem setting and develop a refusal discovery method, LLM-crawler, that uses token prefilling to find forbidden topics. We benchmark the LLM-crawler on Tulu-3-8B, an open-source model with public safety tuning data. Our crawler manages to retrieve 31 out of 36 topics within a budget of 1000 prompts. Next, we scale the crawl to a frontier model using the prefilling option of Claude-Haiku. Finally, we crawl three widely used open-weight models: Llama-3.3-70B and two of its variants finetuned for reasoning: DeepSeek-R1-70B and Perplexity-R1-1776-70B. DeepSeek-R1-70B reveals patterns consistent with censorship tuning: The model exhibits"thought suppression"behavior that indicates memorization of CCP-aligned responses. Although Perplexity-R1-1776-70B is robust to censorship, LLM-crawler elicits CCP-aligned refusals answers in the quantized model. Our findings highlight the critical need for refusal discovery methods to detect biases, boundaries, and alignment failures of AI systems.