What Matters More For In-Context Learning under Matched Compute Budgets: Pretraining on Natural Text or Incorporating Targeted Synthetic Examples?

📅 2025-09-26
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This study investigates whether incorporating targeted synthetic data—rather than purely natural text—during pretraining improves in-context learning (ICL) under fixed computational budget (iso-FLOPs). We propose Bi-Induct, a lightweight curricular intervention that injects forward/backward copying patterns into the pretraining stream, and conduct systematic analysis via head-level telemetry, induction-head ablation, and multi-dimensional ICL probing. Key findings: (1) Early activation of induction circuits does not guarantee ICL gains; critical is their evolution into functionally necessary load-bearing structures. (2) Under natural-text pretraining, larger models (e.g., 1B parameters) spontaneously develop induction heads earlier, more broadly, and more centrally—outperforming synthetic-data-augmented models on ICL. (3) On functional few-shot tasks, the 1B natural-text model achieves peak performance and exhibits greater sensitivity to induction-head ablation, confirming its induction circuits are more functionally integrated and load-bearing.

Technology Category

Application Category

📝 Abstract
Does explicitly exercising the induction circuit during pretraining improve in-context learning (ICL), or is natural text sufficient when compute is held constant (iso-FLOPs)? To test whether targeted synthetic data can accelerate induction-head emergence and enhance ICL, we introduce Bi-Induct, a lightweight curriculum that injects forward-copy (Induction), backward-copy (Anti), or a balanced mix into the pretraining stream. We train models from 0.13B to 1B parameters under iso-FLOPs, evaluating (i) few-shot ICL benchmarks, (ii) head-level telemetry, and (iii) held-out language modeling perplexity. Our findings challenge the assumption that early induction circuit activation directly improves ICL. While Bi-Induct accelerates induction-head emergence at small scales, this does not consistently yield stronger generalization. On standard LM benchmarks, Bi-Induct matches natural-only training; on function-style ICL probes, the 1B natural-only performs best. Stress tests (e.g., label permutation, HITS@1 vs. HITS@3, 1 vs. 10 shots) preserve these trends. Telemetry shows larger natural-only models develop broader, earlier induction heads without explicit induction patterns. Anti-induction data fails to elicit meaningful activation. Perplexity penalties from synthetic data shrink with scale, suggesting larger models can absorb non-natural patterns with minimal cost. Crucially, ablating the top 2% of induction heads degrades ICL more than random ablations, especially for natural-only models, indicating more centralized, load-bearing circuits. Bi-Induct variants exhibit more redundant induction activity, implying different circuit utilization. Overall, inducing activation is not sufficient: ICL gains depend on these circuits becoming functionally necessary. These results underscore mechanism-aware pretraining diagnostics and data mixtures that foster load-bearing, not merely present, structure.
Problem

Research questions and friction points this paper is trying to address.

Comparing natural text versus synthetic data for in-context learning under equal compute budgets
Testing whether targeted synthetic examples accelerate induction-head emergence during pretraining
Evaluating if early induction circuit activation directly improves in-context learning generalization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Injecting synthetic data into pretraining to test induction circuits
Evaluating models under iso-FLOPs with telemetry and benchmarks
Finding that circuit activation alone does not ensure ICL gains
🔎 Similar Papers
No similar papers found.