đ€ AI Summary
This study investigates how teacher models in subliminal learning transfer task-relevant knowledge to students via task-agnostic inputâoutput pairs, particularly under conditions where the student and teacher share no common initialization. By constructing a multi-head output architecture on MNISTâseparating an auxiliary head from the classification headâand combining random initialization, architectural changes (e.g., MLP to CNN), representational similarity analysis, and theoretical derivation, the work demonstrates that effective knowledge transfer hinges on output head compatibility rather than initialization alignment. The research establishes, for the first time, that subliminal learning is driven by compatible output heads, provides a theoretical characterization of its mechanism, and derives an upper-bound condition for failure. Remarkably, even with randomly initialized or architecturally distinct hidden layers, students can recover the teacherâs signal from pure noise if the auxiliary head is compatible; performance approaches or matches that of the teacher when the classification head is also compatible.
đ Abstract
In the context of artificial neural networks, subliminal learning refers to the transfer of task-relevant knowledge or unintended biases from teacher to student models through distillation on task-unrelated input$\unicode{x2013}$output pairs. Prior explanations tie this effect to shared or closely matched teacher$\unicode{x2013}$student initialization. We show that a closely matched initialization is not necessary. Instead, subliminal learning is governed by compatible output heads. Using a controlled MNIST setting, we split outputs into an auxiliary head (for auxiliary, task-unrelated noise signals) and a class head (for classification) to demonstrate subliminal learning occurs$\unicode{x2014}$even when we randomly initialize hidden layers and remove layers, add new layers, or change the architecture (MLP-to-CNN). Compatible auxiliary heads enable transfer of a recoverable teacher signal, bringing the student's representations closer to the teacher's. When the class heads remain compatible as well, students trained only on task-unrelated noise can approach, and in favorable regimes match, teacher-level task performance. Our setting enables us to develop a theory that explains the mechanism of subliminal learning and to derive upper bounds on when subliminal learning fails. Together, our results turn subliminal learning from a surprising transfer effect into a theoretically grounded mechanism with predictable limits.