🤖 AI Summary
Identifying latent discrete structures in multivariate distributions is challenging due to their unknown existence, form, and lack of prior assumptions.
Method: We propose a model-free, multiscale nonparametric maximum likelihood estimation framework that jointly models the joint density across scales and introduces a novel prior-free existence test for latent structures. Without presupposing the presence or parametric form of latent structure, it adaptively discovers and integrates discrete patterns—from coarse- to fine-grained—through data-driven density estimation.
Contribution/Results: Theoretically, we establish the asymptotic distribution of the estimator. Methodologically, the framework unifies interpretable discrete representation learning with data-driven model selection. Experiments on diverse complex distributions demonstrate substantial improvements in clustering interpretability and latent structure recovery accuracy, validating its capacity to effectively capture underlying data-generating mechanisms.
📝 Abstract
Multivariate distributions often carry latent structures that are difficult to identify and estimate, and which better reflect the data generating mechanism than extrinsic structures exhibited simply by the raw data. In this paper, we propose a model-free approach for estimating such latent structures whenever they are present, without assuming they exist a priori. Given an arbitrary density $p_0$, we construct a multiscale representation of the density and propose data-driven methods for selecting representative models that capture meaningful discrete structure. Our approach uses a nonparametric maximum likelihood estimator to estimate the latent structure at different scales and we further characterize their asymptotic limits. By carrying out such a multiscale analysis, we obtain coarseto-fine structures inherent in the original distribution, which are integrated via a model selection procedure to yield an interpretable discrete representation of it. As an application, we design a clustering algorithm based on the proposed procedure and demonstrate its effectiveness in capturing a wide range of latent structures.