Holes in Latent Space: Topological Signatures Under Adversarial Influence

📅 2025-05-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates how adversarial attacks—specifically backdoor fine-tuning and indirect prompt injection—affect the topological structure of large language model (LLM) latent spaces. We introduce persistent homology (PH), a tool from topological data analysis, to LLM adversarial robustness analysis for the first time, proposing a neuron-level PH framework that enables multi-scale topological quantification of intra-layer and inter-layer information flow. Extensive experiments across six state-of-the-art LLMs reveal a robust “topological consistency compression” under adversarial conditions: local-scale feature diversity markedly decreases, while global-scale dominant structures strengthen. This phenomenon is consistent across layers, architectures, and parameter scales, and strongly correlates with the emergence of deep-layer adversarial effects. Our work establishes a geometrically grounded, interpretable paradigm for understanding LLM adversarial vulnerability.

Technology Category

Application Category

📝 Abstract
Understanding how adversarial conditions affect language models requires techniques that capture both global structure and local detail within high-dimensional activation spaces. We propose persistent homology (PH), a tool from topological data analysis, to systematically characterize multiscale latent space dynamics in LLMs under two distinct attack modes -- backdoor fine-tuning and indirect prompt injection. By analyzing six state-of-the-art LLMs, we show that adversarial conditions consistently compress latent topologies, reducing structural diversity at smaller scales while amplifying dominant features at coarser ones. These topological signatures are statistically robust across layers, architectures, model sizes, and align with the emergence of adversarial effects deeper in the network. To capture finer-grained mechanisms underlying these shifts, we introduce a neuron-level PH framework that quantifies how information flows and transforms within and across layers. Together, our findings demonstrate that PH offers a principled and unifying approach to interpreting representational dynamics in LLMs, particularly under distributional shift.
Problem

Research questions and friction points this paper is trying to address.

Analyzing adversarial effects on LLM latent space topology
Characterizing multiscale dynamics under backdoor and prompt attacks
Quantifying information flow shifts via neuron-level topological analysis
Innovation

Methods, ideas, or system contributions that make the work stand out.

Persistent homology analyzes adversarial latent space dynamics
Neuron-level PH framework tracks information flow
Topological signatures reveal adversarial compression effects
🔎 Similar Papers
No similar papers found.
A
Aideen Fay
Department of Mathematics, Imperial College London
I
Inés García-Redondo
Department of Mathematics, Imperial College London
Q
Qiquan Wang
Department of Mathematics, Imperial College London
Haim Dubossarsky
Haim Dubossarsky
Lecturer, Queen Mary University of London
Natural Language ProcessingComputational LinguisticsLanguage Change
Anthea Monod
Anthea Monod
Associate Professor, Department of Mathematics, Imperial College London
Applied Algebraic GeometryTopological Data AnalysisMathematical Biology