🤖 AI Summary
This study addresses the problem of recovering unknown nonparametric component densities from multiple groups of mixed observational samples—commonly known as density deconvolution in continuous settings. To this end, the authors propose a novel kernel density estimator that first treats histogram vectors as “documents” and applies topic modeling to obtain group-specific mixing weights, followed by a debiasing correction based on U-statistics. This approach represents the first extension of discrete topic models to continuous density deconvolution. Under Nikol’skii smoothness classes, the method achieves the minimax optimal convergence rate for nonparametric deconvolution. Theoretical analysis demonstrates that the integrated mean squared error of the proposed estimator attains the information-theoretic lower bound, thereby establishing its rate optimality under the given conditions.
📝 Abstract
Motivated by applications in statistics and machine learning, we consider a problem of unmixing convex combinations of nonparametric densities. Suppose we observe $n$ groups of samples, where the $i$th group consists of $N_i$ independent samples from a $d$-variate density $f_i(x)=\sum_{k=1}^K π_i(k)g_k(x)$. Here, each $g_k(x)$ is a nonparametric density, and each $π_i$ is a $K$-dimensional mixed membership vector. We aim to estimate $g_1(x), \ldots,g_K(x)$. This problem generalizes topic modeling from discrete to continuous variables and finds its applications in LLMs with word embeddings.
In this paper, we propose an estimator for the above problem, which modifies the classical kernel density estimator by assigning group-specific weights that are computed by topic modeling on histogram vectors and de-biased by U-statistics. For any $β>0$, assuming that each $g_k(x)$ is in the Nikol'ski class with a smooth parameter $β$, we show that the sum of integrated squared errors of the constructed estimators has a convergence rate that depends on $n$, $K$, $d$, and the per-group sample size $N$. We also provide a matching lower bound, which suggests that our estimator is rate-optimal.