Manifold of Failure: Behavioral Attraction Basins in Language Models

📅 2026-02-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the limited understanding of unsafe regions in large language models by introducing the novel concept of “behavioral attraction basins,” reframing safety vulnerability discovery as a quality-diversity optimization problem. Leveraging the MAP-Elites algorithm combined with an “alignment deviation” metric, the method systematically maps the model’s “failure manifold,” revealing its continuous topological structure. Experiments on Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini demonstrate that the approach achieves up to 63% behavioral coverage and identifies as many as 370 distinct vulnerability niches, substantially outperforming baseline methods such as GCG, PAIR, and TAP. This represents a paradigm shift from isolated failure detection to a comprehensive understanding of the global safety landscape.

Technology Category

Application Category

📝 Abstract
While prior work has focused on projecting adversarial examples back onto the manifold of natural data to restore safety, we argue that a comprehensive understanding of AI safety requires characterizing the unsafe regions themselves. This paper introduces a framework for systematically mapping the Manifold of Failure in Large Language Models (LLMs). We reframe the search for vulnerabilities as a quality diversity problem, using MAP-Elites to illuminate the continuous topology of these failure regions, which we term behavioral attraction basins. Our quality metric, Alignment Deviation, guides the search towards areas where the model's behavior diverges most from its intended alignment. Across three LLMs: Llama-3-8B, GPT-OSS-20B, and GPT-5-Mini, we show that MAP-Elites achieves up to 63% behavioral coverage, discovers up to 370 distinct vulnerability niches, and reveals dramatically different model-specific topological signatures: Llama-3-8B exhibits a near-universal vulnerability plateau (mean Alignment Deviation 0.93), GPT-OSS-20B shows a fragmented landscape with spatially concentrated basins (mean 0.73), and GPT-5-Mini demonstrates strong robustness with a ceiling at 0.50. Our approach produces interpretable, global maps of each model's safety landscape that no existing attack method (GCG, PAIR, or TAP) can provide, shifting the paradigm from finding discrete failures to understanding their underlying structure.
Problem

Research questions and friction points this paper is trying to address.

Manifold of Failure
Behavioral Attraction Basins
Alignment Deviation
AI Safety
Large Language Models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Manifold of Failure
Behavioral Attraction Basins
Quality Diversity
MAP-Elites
Alignment Deviation
🔎 Similar Papers
No similar papers found.