Induced Model Matching: Restricted Models Help Train Full-Featured Models

📅 2024-02-19
🏛️ Neural Information Processing Systems
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of leveraging high-accuracy, feature-constrained small models—e.g., bigram-only logistic regression—to guide the training of full-feature large models (e.g., LSTM or Transformer). To this end, we propose Induced Model Matching (IMM): a novel framework that constructs a context-induced distribution alignment loss to enforce output consistency between the large model and the small model within the restricted feature subspace. We formally characterize the “induced matching” principle for the first time, showing that noise injection and reverse knowledge distillation are inconsistent approximations of IMM. Crucially, IMM transforms the small model’s structural constraints into structured supervision signals, thereby improving generalization and decision consistency. We provide theoretical guarantees establishing the statistical consistency of IMM. Empirically, IMM significantly enhances large-model performance on language modeling and POMDP-to-MDP policy transfer tasks.

Technology Category

Application Category

📝 Abstract
We consider scenarios where a very accurate (often small) predictive model using restricted features is available when training a full-featured (often larger) model. This restricted model may be thought of as side-information'', and can come either from an auxiliary dataset or from the same dataset by forcing the restriction. How can the restricted model be useful to the full model? To answer this, we introduce a methodology called Induced Model Matching (IMM). IMM aligns the context-restricted, or induced, version of the large model with the restricted model. We relate IMM to approaches such as noising, which is implicit in addressing the problem, and reverse knowledge distillation from weak teachers, which is explicit but does not exploit restriction being the nature of the weakness. We show that these prior methods can be thought of as approximations to IMM and can be problematic in terms of consistency. Experimentally, we first motivate IMM using logistic regression as a toy example. We then explore it in language modeling, the application that initially inspired it, and demonstrate it on both LSTM and transformer full models, using bigrams as restricted models. We lastly give a simple RL example, which shows that POMDP policies can help learn better MDP policies. The IMM principle is thus generally applicable in common scenarios where restricted data is cheaper to collect or restricted models are easier to learn.
Problem

Research questions and friction points this paper is trying to address.

Utilizing restricted models to enhance full-featured model training
Aligning induced large models with restricted models via IMM
Addressing consistency issues in prior weak teacher distillation methods
Innovation

Methods, ideas, or system contributions that make the work stand out.

Induced Model Matching aligns restricted and full models
Uses restricted models as side-information for training
Applicable in language modeling and RL scenarios
🔎 Similar Papers
No similar papers found.