Dual Process Learning: Controlling Use of In-Context vs. In-Weights Strategies with Weight Forgetting

πŸ“… 2024-05-28
πŸ›οΈ arXiv.org
πŸ“ˆ Citations: 4
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Language models struggle to simultaneously support in-context learning (ICL) and in-weight learning (IWL) for out-of-vocabulary tokens, particularly lacking structural ICLβ€”the ability to generalize compositional, context-sensitive representations for arbitrary novel words. Method: We formally define and systematically investigate structural ICL, proposing a tunable dual-path mechanism that jointly models contextual and parametric learning within a single model. Building upon Chen et al. (2024)’s active forgetting framework, we introduce a joint pretraining-finetuning paradigm, validated via synthetic tasks, masked language modeling (MLM), and autoregressive modeling. Contribution/Results: We discover that structural ICL emerges transiently early in pretraining but rapidly decays. Our method not only restores this capability but also enables continuous, task-aware trade-off control between ICL and IWL preferences. Empirically, it achieves synergistic generalization across diverse tasks, demonstrating the first unified approach to co-optimizing structural ICL and IWL.

Technology Category

Application Category

πŸ“ Abstract
Language models have the ability to perform in-context learning (ICL), allowing them to flexibly adapt their behavior based on context. This contrasts with in-weights learning (IWL), where memorized information is encoded in model parameters after iterated observations of data. An ideal model should be able to flexibly deploy both of these abilities. Despite their apparent ability to learn in-context, language models are known to struggle when faced with unseen or rarely seen tokens (Land&Bartolo, 2024). Hence, we study $ extbf{structural in-context learning}$, which we define as the ability of a model to execute in-context learning on arbitrary novel tokens -- so called because the model must generalize on the basis of e.g. sentence structure or task structure, rather than content encoded in token embeddings. We study structural in-context algorithms on both synthetic and naturalistic tasks using toy models, masked language models, and autoregressive language models. We find that structural ICL appears before quickly disappearing early in LM pretraining. While it has been shown that ICL can diminish during training (Singh et al., 2023), we find that prior work does not account for structural ICL. Building on Chen et al. (2024) 's active forgetting method, we introduce pretraining and finetuning methods that can modulate the preference for structural ICL and IWL. Importantly, this allows us to induce a $ extit{dual process strategy}$ where in-context and in-weights solutions coexist within a single model.
Problem

Research questions and friction points this paper is trying to address.

Enhancing language models' ability to generalize on novel tokens.
Balancing in-context and in-weights learning strategies effectively.
Preventing early disappearance of structural in-context learning during training.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces structural in-context learning for novel tokens
Modulates preference for ICL and IWL using forgetting methods
Enables dual process strategy in a single model