π€ AI Summary
To address the challenges of efficiency, accuracy, and robustness to missing data in Bayesian clustering for large-scale datasets, this paper proposes Cluster-PFNβthe first method extending Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Built upon a Transformer architecture, Cluster-PFN is trained on synthetic data generated from a finite Gaussian mixture model prior, enabling end-to-end joint inference of both the optimal number of clusters and cluster assignments. Crucially, it natively supports missing-data modeling without requiring imputation or manual model selection. Experiments demonstrate that Cluster-PFN significantly outperforms classical criteria (AIC, BIC) and variational inference in clustering accuracy, achieves speedups of several orders of magnitude in inference time, and substantially surpasses state-of-the-art imputation-based baselines on high-missingness genomic data.
π Abstract
Bayesian clustering accounts for uncertainty but is computationally demanding at scale. Furthermore, real-world datasets often contain missing values, and simple imputation ignores the associated uncertainty, resulting in suboptimal results. We present Cluster-PFN, a Transformer-based model that extends Prior-Data Fitted Networks (PFNs) to unsupervised Bayesian clustering. Trained entirely on synthetic datasets generated from a finite Gaussian Mixture Model (GMM) prior, Cluster-PFN learns to estimate the posterior distribution over both the number of clusters and the cluster assignments. Our method estimates the number of clusters more accurately than handcrafted model selection procedures such as AIC, BIC and Variational Inference (VI), and achieves clustering quality competitive with VI while being orders of magnitude faster. Cluster-PFN can be trained on complex priors that include missing data, outperforming imputation-based baselines on real-world genomic datasets, at high missingness. These results show that the Cluster-PFN can provide scalable and flexible Bayesian clustering.