Alice and the Caterpillar: A more descriptive null model for assessing data mining results

📅 2022-11-01

🏛️ Industrial Conference on Data Mining

📈 Citations: 3

✨ Influential: 0

career value

189K/year

🤖 AI Summary

This paper addresses the limited expressiveness of null models in binary transaction and sequence data mining, specifically their inability to adequately capture joint-degree structures. We propose a novel null model that preserves the bipartite graph’s joint-degree matrix—including “caterpillar” subgraphs (paths of length three)—by explicitly constraining joint-degree distributions in the null space for the first time. This enables more faithful retention of critical topological features from the original data. Methodologically, we design the Alice algorithm suite, a Markov Chain Monte Carlo (MCMC) framework featuring a customized state space and efficient neighborhood transition operators, ensuring rapid mixing and strong scalability. Experiments demonstrate that our model significantly enhances the discriminative power and statistical reliability of hypothesis testing. On multiple real-world datasets, it successfully identifies statistically significant patterns missed by conventional approaches, empirically validating its superior statistical performance.

Technology Category

Application Category

📝 Abstract

We introduce novel null models for assessing the results obtained from observed binary transactional and sequence datasets, using statistical hypothesis testing. Our null models maintain more properties of the observed dataset than existing ones. Specifically, they preserve the Bipartite Joint Degree Matrix of the bipartite (multi-)graph corresponding to the dataset, which ensures that the number of caterpillars, i.e., paths of length three, is preserved, in addition to other properties considered by other models. We describe Alice , a suite of Markov chain Monte Carlo algorithms for sampling datasets from our null models, based on a carefully defined set of states and efficient operations to move between them. The results of our experimental evaluation show that Alice mixes fast and scales well, and that our null model finds different significant results than ones previously considered in the literature.

Problem

Research questions and friction points this paper is trying to address.

Develops null models for binary transactional datasets

Preserves Bipartite Joint Degree Matrix properties

Introduces Alice for efficient dataset sampling

Innovation

Methods, ideas, or system contributions that make the work stand out.

Novel null models for binary data

Preserves Bipartite Joint Degree Matrix

Alice: MCMC algorithms for sampling

🔎 Similar Papers

No similar papers found.