🤖 AI Summary
Traditional embedding methods for high-cardinality categorical data suffer from excessive parameter counts—scaling linearly with dictionary size—and thus incur high computational costs and overfitting risks. To address this, we propose MMbeddings, a probabilistic embedding framework based on variational autoencoders that introduces nonlinear mixture models into deep embedding learning for the first time. It models categorical variables as stochastic latent variables drawn from learnable distributions and employs a shared encoder to produce low-dimensional embeddings. By replacing fixed embeddings with random effects, MMbeddings reduces parameter count by over 90% while significantly improving generalization. Extensive experiments on collaborative filtering and tabular regression tasks demonstrate that MMbeddings consistently outperforms state-of-the-art baselines across multiple real-world and synthetic datasets, achieving substantial reductions in memory footprint and training cost.
📝 Abstract
We present MMbeddings, a probabilistic embedding approach that reinterprets categorical embeddings through the lens of nonlinear mixed models, effectively bridging classical statistical theory with modern deep learning. By treating embeddings as latent random effects within a variational autoencoder framework, our method substantially decreases the number of parameters -- from the conventional embedding approach of cardinality $ imes$ embedding dimension, which quickly becomes infeasible with large cardinalities, to a significantly smaller, cardinality-independent number determined primarily by the encoder architecture. This reduction dramatically mitigates overfitting and computational burden in high-cardinality settings. Extensive experiments on simulated and real datasets, encompassing collaborative filtering and tabular regression tasks using varied architectures, demonstrate that MMbeddings consistently outperforms traditional embeddings, underscoring its potential across diverse machine learning applications.