🤖 AI Summary
This study investigates the fundamental reasons why probabilistic circuits (PCs) underperform large Transformer-based language models in autoregressive language modeling. By establishing a unified autoregressive framework, the work systematically compares the two model classes in terms of output parameterization and context encoding. It reveals that PCs typically employ probability-space parameterization, whereas Transformers use logit-space parameterization—a distinction that substantially narrows the performance gap when aligned. Further theoretical and empirical analysis demonstrates that although structured decomposable PCs are strictly more expressive, their fixed routing architectures are highly sensitive to dependency topology and struggle to effectively capture heterogeneous long-range dependencies. This work is the first to introduce separation rank theory into the analysis of autoregressive PCs, clarifying the critical roles of parameterization schemes and structural constraints in modeling capacity.
📝 Abstract
Probabilistic Circuits (PCs) are deep generative models that support exact and efficient probabilistic inference. Yet in autoregressive language modeling, PCs still lag behind Transformer-based large language models (LLMs), suggesting an important expressivity gap. In this work, we compare PCs and LLMs under a unified autoregressive formulation. First, an output bottleneck: PCs parameterize predictions as convex combinations in probability space, which struggles to represent the sharp distributions typical of language; adopting a logit-space parameterization substantially narrows this gap. Second, a context-encoding bottleneck: we prove that structured-decomposable PCs can match Transformer separation rank on vtree-aligned partitions, but show, both theoretically and empirically, that this capacity is limited to partitions aligned with the fixed routing structure, leading to severe degradation when the data exhibits heterogeneous dependency topologies. We further prove that decomposable PCs are strictly more expressive than structured-decomposable ones, though effectively optimizing them remains an open challenge.