🤖 AI Summary
To address the lack of statistical guarantees in knowledge graph construction from high-dimensional sparse electronic health records (EHRs) and the privacy-induced scarcity of patient-level data, this paper proposes the first asymptotically normal theoretical framework for edge inference in sparse dependency knowledge graphs. Methodologically, it integrates low-rank temporal dependency modeling, dynamic logistic linear topic modeling, and singular value decomposition of pointwise mutual information matrices, coupled with entrywise asymptotic normality analysis to enable edge significance testing with controlled Type-I error. It innovatively bridges a critical theoretical gap by establishing asymptotic normality for nonlinear statistics in graph structure inference. Experiments demonstrate strict control of edge false discovery rates in simulations; on real EHR data, it successfully constructs interpretable clinical knowledge graphs and generates discriminative feature embeddings, substantially improving both statistical efficiency and clinical interpretability.
📝 Abstract
The effective analysis of high-dimensional Electronic Health Record (EHR) data, with substantial potential for healthcare research, presents notable methodological challenges. Employing predictive modeling guided by a knowledge graph (KG), which enables efficient feature selection, can enhance both statistical efficiency and interpretability. While various methods have emerged for constructing KGs, existing techniques often lack statistical certainty concerning the presence of links between entities, especially in scenarios where the utilization of patient-level EHR data is limited due to privacy concerns. In this paper, we propose the first inferential framework for deriving a sparse KG with statistical guarantee based on the dynamic log-linear topic model proposed by cite{arora2016latent}. Within this model, the KG embeddings are estimated by performing singular value decomposition on the empirical pointwise mutual information matrix, offering a scalable solution. We then establish entrywise asymptotic normality for the KG low-rank estimator, enabling the recovery of sparse graph edges with controlled type I error. Our work uniquely addresses the under-explored domain of statistical inference about non-linear statistics under the low-rank temporal dependent models, a critical gap in existing research. We validate our approach through extensive simulation studies and then apply the method to real-world EHR data in constructing clinical KGs and generating clinical feature embeddings.