On the Impossibility of Separating Intelligence from Judgment: The Computational Intractability of Filtering for AI Alignment

📅 2025-07-09

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This paper addresses the computational feasibility of external filtering mechanisms—specifically input (prompt) and output filtering—for safety alignment of large language models (LLMs). Method: Leveraging standard cryptographic hardness assumptions, we formally prove the existence of polynomial-time indistinguishable malicious and benign prompts, and show that harmful outputs drawn from natural distributions are computationally infeasible to detect efficiently. Results: We establish that external filtering is fundamentally limited under computational constraints, and black-box access alone cannot guarantee safety. Crucially, this work provides the first complexity-theoretic refutation of the existence of universal, efficient external filters. It further introduces the theoretical thesis that “intelligence and judgment are inseparable,” rigorously arguing that safety must be intrinsically embedded in model architecture and parameters—not delegated to post-hoc interventions.

Technology Category

Application Category

📝 Abstract

With the increased deployment of large language models (LLMs), one concern is their potential misuse for generating harmful content. Our work studies the alignment challenge, with a focus on filters to prevent the generation of unsafe information. Two natural points of intervention are the filtering of the input prompt before it reaches the model, and filtering the output after generation. Our main results demonstrate computational challenges in filtering both prompts and outputs. First, we show that there exist LLMs for which there are no efficient prompt filters: adversarial prompts that elicit harmful behavior can be easily constructed, which are computationally indistinguishable from benign prompts for any efficient filter. Our second main result identifies a natural setting in which output filtering is computationally intractable. All of our separation results are under cryptographic hardness assumptions. In addition to these core findings, we also formalize and study relaxed mitigation approaches, demonstrating further computational barriers. We conclude that safety cannot be achieved by designing filters external to the LLM internals (architecture and weights); in particular, black-box access to the LLM will not suffice. Based on our technical results, we argue that an aligned AI system's intelligence cannot be separated from its judgment.

Problem

Research questions and friction points this paper is trying to address.

Filtering harmful content in LLMs is computationally intractable

Adversarial prompts bypass efficient input filters undetectably

Output filtering fails under cryptographic hardness assumptions

Innovation

Methods, ideas, or system contributions that make the work stand out.

No efficient prompt filters for adversarial inputs

Output filtering computationally intractable in natural settings

Safety requires internal LLM design, not external filters

🔎 Similar Papers

Towards Bidirectional Human-AI Alignment: A Systematic Review for Clarifications, Framework, and Future Directions