Subliminal Corruption: Mechanisms, Thresholds, and Interpretability

📅 2025-10-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work identifies “subconscious corruption”: a phenomenon wherein semantically neutral synthetic data implicitly propagates harmful features during model fine-tuning, evading standard safety checks and inducing cross-model behavioral degradation and alignment failure. We design a teacher-student training framework based on GPT-2 and integrate scaling-law analysis, phase-transition threshold experiments, and interpretability techniques to quantitatively characterize the propagation dynamics of this contamination—revealing a critical threshold beyond which systemic alignment collapse occurs, while contamination remains undetectable below it. Results demonstrate that the implicit corruption mechanism closely mimics legitimate fine-tuning behavior, exhibiting high stealth; its propagation follows a cascade pattern amenable to formal modeling. This study fills a fundamental theoretical gap in synthetic data safety, establishing a novel evaluation dimension and principled defense foundation for robust AI alignment.

Technology Category

Application Category

📝 Abstract
As machine learning models are increasingly fine-tuned on synthetic data, there is a critical risk of subtle misalignments spreading through interconnected AI systems. This paper investigates subliminal corruption, which we define as undesirable traits are transmitted through semantically neutral data, bypassing standard safety checks. While this phenomenon has been identified, a quantitative understanding of its dynamics is missing. To address this gap, we present a systematic study of the scaling laws, thresholds, and mechanisms of subliminal corruption using a teacher-student setup with GPT-2. Our experiments reveal three key findings: (1) subliminal corruption causes behavioral crossover, degrading the model's overall alignment, not just the targeted trait; (2) alignment fails in a sharp phase transition at a critical threshold of poisoned data, rather than degrading gradually; and (3) interpretability analysis shows the corruption mechanism mimics the model's natural fine-tuning process, making it difficult to detect. These results demonstrate a critical vulnerability in AI systems that rely on synthetic data and highlight the need for new safety protocols that can account for latent threats.
Problem

Research questions and friction points this paper is trying to address.

Studying subliminal corruption in AI systems using synthetic data
Quantifying corruption thresholds and mechanisms via teacher-student experiments
Analyzing undetectable misalignments that bypass standard safety checks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Teacher-student setup with GPT-2 analyzes subliminal corruption
Identifies sharp phase transition at critical poisoned data threshold
Interpretability reveals corruption mimics natural fine-tuning process
🔎 Similar Papers
No similar papers found.
Reya Vir
Reya Vir
Researcher, Berkeley AI Research
S
Sarvesh Bhatnagar
Department of Computer Science and Engineering, University of Michigan, MI, USA