NApy: Efficient Statistics in Python for Large-Scale Heterogeneous Data with Enhanced Support for Missing Data

📅 2025-05-01

📈 Citations: 0

✨ Influential: 0

career value

224K/year

🤖 AI Summary

To address the challenge of enabling real-time statistical inference on large-scale heterogeneous biomedical data with arbitrary missingness patterns, this paper proposes a lightweight, missingness-aware high-performance statistical computing framework. The framework innovatively integrates Numba’s just-in-time compilation with hand-optimized C++ kernels and OpenMP-based multithreading, specifically designed for efficient computation over mixed data types and general missing-data mechanisms. Compared to mainstream Python libraries (e.g., SciPy, statsmodels) and naïve parallel implementations, our approach achieves speedups of several orders of magnitude in runtime and significant reductions in memory footprint across canonical hypothesis tests, enabling millisecond-scale interactive analysis. The framework is open-sourced and optimized for deployment in resource-constrained environments, facilitating rapid, scalable exploratory statistical analysis in biomedical research.

Technology Category

Application Category

📝 Abstract

Existing Python libraries and tools lack the ability to efficiently compute statistical test results for large datasets in the presence of missing values. This presents an issue as soon as constraints on runtime and memory availability become essential considerations for a particular usecase. Relevant research areas where such limitations arise include interactive tools and databases for exploratory analysis of biomedical data. To address this problem, we present the Python package NApy, which relies on a Numba and C++ backend with OpenMP parallelization to enable scalable statistical testing for mixed-type datasets in the presence of missing values. Both with respect to runtime and memory consumption, NApy outperforms competitor tools and baseline implementations with naive Python-based parallelization by orders of magnitude, thereby enabling on-the-fly analyses in interactive applications. NApy is publicly available at https://github.com/DyHealthNet/NApy.

Problem

Research questions and friction points this paper is trying to address.

Efficient statistical computation for large heterogeneous datasets

Handling missing data in statistical tests with Python

Improving runtime and memory usage for biomedical data analysis

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Numba and C++ backend for efficiency

Implements OpenMP parallelization for scalability

Optimizes runtime and memory for large datasets

🔎 Similar Papers

RandALO: Out-of-sample risk estimation in no time flat