🤖 AI Summary
Baum–Welch learning for Hidden Markov Models (HMMs) suffers from local optima, while spectral methods often yield invalid parameter estimates. Method: We propose Belief Net—a structured neural network that explicitly parameterizes the initial distribution, transition matrix, and emission matrix via logits, thereby modeling the HMM forward-filtering process as an end-to-end differentiable, interpretable architecture. It adopts a decoder-only design with autoregressive observation prediction loss to directly learn the Bayesian forward mechanism via gradient descent. Contributions/Results: On synthetic data, Belief Net converges faster and recovers true parameters more accurately than baselines. In real-world language tasks, it significantly outperforms Transformer baselines and consistently surpasses spectral methods under both overcomplete and undercomplete settings. The approach combines theoretical soundness—rooted in exact Bayesian inference—with strong generalization capability.
📝 Abstract
Hidden Markov Models (HMMs) are fundamental for modeling sequential data, yet learning their parameters from observations remains challenging. Classical methods like the Baum-Welch (EM) algorithm are computationally intensive and prone to local optima, while modern spectral algorithms offer provable guarantees but may produce probability outputs outside valid ranges. This work introduces Belief Net, a novel framework that learns HMM parameters through gradient-based optimization by formulating the HMM's forward filter as a structured neural network. Unlike black-box Transformer models, Belief Net's learnable weights are explicitly the logits of the initial distribution, transition matrix, and emission matrix, ensuring full interpretability. The model processes observation sequences using a decoder-only architecture and is trained end-to-end with standard autoregressive next-observation prediction loss. On synthetic HMM data, Belief Net achieves superior convergence speed compared to Baum-Welch, successfully recovering parameters in both undercomplete and overcomplete settings where spectral methods fail. Comparisons with Transformer-based models are also presented on real-world language data.