🤖 AI Summary
Sufficient dimension reduction (SDR) faces significant challenges in distributed high-dimensional big data settings due to massive sample sizes, high dimensionality, and strong node heterogeneity.
Method: This paper proposes the first unified distributed estimation framework for SDR based on conditional moments. It achieves exact distributed sliced inverse regression (SIR) estimation by combining local conditional moment modeling with global consistency constraints—naturally accommodating heterogeneous data structures—and employs a low-communication-cost distributed iterative optimization strategy to substantially reduce both computational and communication overhead.
Contribution/Results: Theoretically, the global estimator is proven to be √n-consistent and asymptotically normal. Empirically, the method matches the accuracy of centralized SIR on both synthetic and real-world datasets, while demonstrating strong robustness under node failures or increased heterogeneity.
📝 Abstract
Nowadays, massive datasets are typically dispersed across multiple locations, encountering dual challenges of high dimensionality and huge sample size. Therefore, it is necessary to explore sufficient dimension reduction (SDR) methods for distributed data. In this paper, we first propose an exact distributed estimation of sliced inverse regression, which substantially improves computational efficiency while obtaining identical estimation as that on the full sample. Then, we propose a unified distributed framework for general conditional-moment-based inverse regression methods. This framework allows for distinct population structure for data distributed at different locations, thus addressing the issue of heterogeneity. To assess the effectiveness of our proposed methods, we conduct simulations incorporating various data generation mechanisms, and examine scenarios where samples are homogeneous equally, heterogeneous equally, and heterogeneous unequally scattered across local nodes. Our findings highlight the versatility and applicability of the unified framework. Meanwhile, the communication cost is practically acceptable and the computation cost is greatly reduced. Sensitivity analysis verifies the robustness of the algorithm under extreme conditions where the SDR method locally fails on some nodes. A real data analysis also demonstrates the superior performance of the algorithm.