On Minimax Estimation of Parameters in Softmax-Contaminated Mixture of Experts

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

223K/year

🤖 AI Summary

This work studies the convergence rate of maximum likelihood estimation (MLE) for gating and prompting parameters in the softmax-contaminated mixture-of-experts (MoE) model—a framework widely adopted in large-model fine-tuning, where pretrained experts are frozen and newly introduced trainable prompts act as “contaminating” experts. Addressing the statistical identifiability challenge arising from functional overlap between prompts and pretrained knowledge, we introduce the notion of *distinguishability*, which formally characterizes how such overlap fundamentally impedes parameter estimation. Based on this, we derive minimax-optimal convergence rates under both distinguishable and indistinguishable regimes, proving that rates severely deteriorate in the latter. Our analysis integrates MLE theory, information-theoretic lower bounds, asymptotic inference, and nonconvex optimization techniques, supported by numerical experiments. This work provides the first systematic statistical foundation for prompt-based fine-tuning.

Technology Category

Application Category

📝 Abstract

The softmax-contaminated mixture of experts (MoE) model is deployed when a large-scale pre-trained model, which plays the role of a fixed expert, is fine-tuned for learning downstream tasks by including a new contamination part, or prompt, functioning as a new, trainable expert. Despite its popularity and relevance, the theoretical properties of the softmax-contaminated MoE have remained unexplored in the literature. In the paper, we study the convergence rates of the maximum likelihood estimator of gating and prompt parameters in order to gain insights into the statistical properties and potential challenges of fine-tuning with a new prompt. We find that the estimability of these parameters is compromised when the prompt acquires overlapping knowledge with the pre-trained model, in the sense that we make precise by formulating a novel analytic notion of distinguishability. Under distinguishability of the pre-trained and prompt models, we derive minimax optimal estimation rates for all the gating and prompt parameters. By contrast, when the distinguishability condition is violated, these estimation rates become significantly slower due to their dependence on the prompt convergence rate to the pre-trained model. Finally, we empirically corroborate our theoretical findings through several numerical experiments.

Problem

Research questions and friction points this paper is trying to address.

Estimating gating and prompt parameters in softmax-contaminated MoE models

Analyzing convergence rates under distinguishability of pre-trained and prompt models

Investigating performance degradation when prompt overlaps with pre-trained model

Innovation

Methods, ideas, or system contributions that make the work stand out.

Softmax-contaminated MoE model for fine-tuning

Minimax optimal estimation rates derivation

Distinguishability condition impacts estimation speed

🔎 Similar Papers

On Expert Estimation in Hierarchical Mixture of Experts: Beyond Softmax Gating Functions