Fast Model Selection and Stable Optimization for Softmax-Gated Multinomial-Logistic Mixture of Experts Models

๐Ÿ“… 2026-02-08
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This work addresses the lack of stability guarantees in maximum likelihood training and the absence of principled model selection for Softmax-gated mixture-of-experts (MoE) models. To this end, we propose a batch Minorization-Maximization algorithm based on an explicit quadratic minorizing function, which enables closed-form coordinate-wise updates. Furthermore, we introduce a scan-free expert number selection mechanism driven by a mixture-measure dendrogram. Our approach provides, for the first time, a stable optimization framework for Softmax-gated MoE with finite-sample theoretical guarantees, achieving near-parametric optimal convergence rates. Experimental results on proteinโ€“protein interaction prediction demonstrate that the proposed method significantly outperforms strong baselines in both predictive accuracy and probability calibration.

Technology Category

Application Category

๐Ÿ“ Abstract
Mixture-of-Experts (MoE) architectures combine specialized predictors through a learned gate and are effective across regression and classification, but for classification with softmax multinomial-logistic gating, rigorous guarantees for stable maximum-likelihood training and principled model selection remain limited. We address both issues in the full-data (batch) regime. First, we derive a batch minorization-maximization (MM) algorithm for softmax-gated multinomial-logistic MoE using an explicit quadratic minorizer, yielding coordinate-wise closed-form updates that guarantee monotone ascent of the objective and global convergence to a stationary point (in the standard MM sense), avoiding approximate M-steps common in EM-type implementations. Second, we prove finite-sample rates for conditional density estimation and parameter recovery, and we adapt dendrograms of mixing measures to the classification setting to obtain a sweep-free selector of the number of experts that achieves near-parametric optimal rates after merging redundant fitted atoms. Experiments on biological protein--protein interaction prediction validate the full pipeline, delivering improved accuracy and better-calibrated probabilities than strong statistical and machine-learning baselines.
Problem

Research questions and friction points this paper is trying to address.

Mixture-of-Experts
softmax gating
model selection
stable optimization
multinomial logistic classification
Innovation

Methods, ideas, or system contributions that make the work stand out.

Mixture-of-Experts
minorization-maximization
softmax gating
model selection
finite-sample guarantees
๐Ÿ”Ž Similar Papers
No similar papers found.