A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

📅 2024-03-20

🏛️ Proceedings of the VLDB Endowment

📈 Citations: 0

✨ Influential: 0

career value

186K/year

🤖 AI Summary

This paper addresses hypothesis testing at the node, edge, and path levels in large-scale attributed graphs, proposing the first formal multi-granularity graph hypothesis testing framework. To overcome the limitations of conventional tabular methods on graph-structured data, we design PHASE—a path-hypothesis-aware random walk sampler—and its optimized variant PHASE-opt, jointly optimizing hypothesis-driven sampling and computational efficiency. Our approach integrates graph sampling theory, *m*-dimensional random walk modeling, and rigorous time-complexity analysis to ensure statistical validity while enhancing scalability. Experiments on three real-world attributed graph datasets demonstrate that, compared to generic sampling baselines, our framework improves testing accuracy by 12.7% and accelerates runtime by 3.8×, significantly strengthening statistical inference capabilities for large-scale attributed graphs.

Technology Category

Application Category

📝 Abstract

Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing on graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses on attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and time-efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m -dimensional random walk that accounts for the paths specified in the hypothesis. We further optimize its time efficiency and propose PHASE opt . Experiments on three real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling methods in terms of accuracy and time efficiency.

Problem

Research questions and friction points this paper is trying to address.

Formalize node, edge, path hypotheses

Develop sampling-based hypothesis testing framework

Propose Path-Hypothesis-Aware SamplEr (PHASE)

Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling-based hypothesis testing framework

Path-Hypothesis-Aware SamplEr (PHASE)

Optimized PHASE for time efficiency

🔎 Similar Papers

Graph sub-sampling for divide-and-conquer algorithms in large networks