🤖 AI Summary
Functional data pose significant statistical modeling challenges due to privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these, we propose a novel semi-parametric vine flow generative model that abandons restrictive Gaussianity and low-rank assumptions. By integrating flow matching with vine copula structures, our method explicitly captures functional smoothness, enabling efficient modeling of irregularly sampled observations and high-fidelity generation of infinite-dimensional functions. Experiments on synthetic benchmarks and real-world MIMIC-IV clinical trajectory data demonstrate that the synthesized data achieve superior fidelity, computational efficiency, and downstream statistical utility—including regression and hypothesis testing—compared to state-of-the-art methods. The framework provides a trustworthy, privacy-preserving generative solution for functional data analysis in sensitive domains.
📝 Abstract
Functional data, i.e., smooth random functions observed over a continuous domain, are increasingly available in areas such as biomedical research, health informatics, and epidemiology. However, effective statistical analysis for functional data is often hindered by challenges such as privacy constraints, sparse and irregular sampling, infinite dimensionality, and non-Gaussian structures. To address these challenges, we introduce a novel framework named Smooth Flow Matching (SFM), tailored for generative modeling of functional data to enable statistical analysis without exposing sensitive real data. Built upon flow-matching ideas, SFM constructs a semiparametric copula flow to generate infinite-dimensional functional data, free from Gaussianity or low-rank assumptions. It is computationally efficient, handles irregular observations, and guarantees the smoothness of the generated functions, offering a practical and flexible solution in scenarios where existing deep generative methods are not applicable. Through extensive simulation studies, we demonstrate the advantages of SFM in terms of both synthetic data quality and computational efficiency. We then apply SFM to generate clinical trajectory data from the MIMIC-IV patient electronic health records (EHR) longitudinal database. Our analysis showcases the ability of SFM to produce high-quality surrogate data for downstream statistical tasks, highlighting its potential to boost the utility of EHR data for clinical applications.