🤖 AI Summary
This study investigates whether incorporating targeted synthetic data—rather than purely natural text—during pretraining improves in-context learning (ICL) under fixed computational budget (iso-FLOPs). We propose Bi-Induct, a lightweight curricular intervention that injects forward/backward copying patterns into the pretraining stream, and conduct systematic analysis via head-level telemetry, induction-head ablation, and multi-dimensional ICL probing. Key findings: (1) Early activation of induction circuits does not guarantee ICL gains; critical is their evolution into functionally necessary load-bearing structures. (2) Under natural-text pretraining, larger models (e.g., 1B parameters) spontaneously develop induction heads earlier, more broadly, and more centrally—outperforming synthetic-data-augmented models on ICL. (3) On functional few-shot tasks, the 1B natural-text model achieves peak performance and exhibits greater sensitivity to induction-head ablation, confirming its induction circuits are more functionally integrated and load-bearing.
📝 Abstract
Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.