Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

πŸ“… 2026-02-08
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF

career value

231K/year
πŸ€– AI Summary
This work addresses the lack of stability guarantees in maximum likelihood training and the absence of principled model selection for Softmax-gated mixture-of-experts (MoE) models. To this end, we propose a batch Minorization-Maximization algorithm based on an explicit quadratic minorizing function, which enables closed-form coordinate-wise updates. Furthermore, we introduce a scan-free expert number selection mechanism driven by a mixture-measure dendrogram. Our approach provides, for the first time, a stable optimization framework for Softmax-gated MoE with finite-sample theoretical guarantees, achieving near-parametric optimal convergence rates. Experimental results on protein–protein interaction prediction demonstrate that the proposed method significantly outperforms strong baselines in both predictive accuracy and probability calibration.

Technology Category

Application Category

πŸ“ Abstract
Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
softmax gating
model selection
stable optimization
multinomial logistic classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
minorization-maximization
softmax gating
model selection
finite-sample guarantees