A distribution-guided Mapper algorithm

📅 2024-01-19
🏛️ arXiv.org
📈 Citations: 1
Influential: 0
📄 PDF

career value

197K/year
🤖 AI Summary
Traditional Mapper algorithms often fail to preserve local topological details when analyzing complex data—such as SARS-CoV-2 RNA sequences—leading to incomplete shape characterization. To address this, we propose D-Mapper: a density-adaptive Mapper algorithm grounded in data distribution and probabilistic modeling. It introduces a novel distribution-guided cover construction mechanism to enable fine-grained modeling of heterogeneous structures. Furthermore, we design a composite evaluation metric integrating overlapping clustering quality with extended persistent homology, enhancing both interpretability and robustness of topological features. On multiple benchmark datasets, D-Mapper consistently outperforms classical Mapper. Applied to SARS-CoV-2 RNA sequences, it successfully uncovers multi-scale mutational patterns, systematically revealing—for the first time—vertical evolutionary lineages and horizontal recombination pathways. These findings provide empirically verifiable topological evidence for viral evolution studies.

Technology Category

Application Category

📝 Abstract
Motivation: The Mapper algorithm is an essential tool to explore shape of data in topology data analysis. With a dataset as an input, the Mapper algorithm outputs a graph representing the topological features of the whole dataset. This graph is often regarded as an approximation of a reeb graph of data. The classic Mapper algorithm uses fixed interval lengths and overlapping ratios, which might fail to reveal subtle features of data, especially when the underlying structure is complex. Results: In this work, we introduce a distribution guided Mapper algorithm named D-Mapper, that utilizes the property of the probability model and data intrinsic characteristics to generate density guided covers and provides enhanced topological features. Our proposed algorithm is a probabilistic model-based approach, which could serve as an alternative to non-prababilistic ones. Moreover, we introduce a metric accounting for both the quality of overlap clustering and extended persistence homology to measure the performance of Mapper type algorithm. Our numerical experiments indicate that the D-Mapper outperforms the classical Mapper algorithm in various scenarios. We also apply the D-Mapper to a SARS-COV-2 coronavirus RNA sequences dataset to explore the topological structure of different virus variants. The results indicate that the D-Mapper algorithm can reveal both vertical and horizontal evolution processes of the viruses. Availability: Our package is available at https://github.com/ShufeiGe/D-Mapper.
Problem

Research questions and friction points this paper is trying to address.

Complex Data Analysis
Topological Data Analysis
SARS-COV-2 RNA Sequences
Innovation

Methods, ideas, or system contributions that make the work stand out.

D-Mapper algorithm
probabilistic modeling
biinformatics application