A Sampling-based Framework for Hypothesis Testing on Large Attributed Graphs

๐Ÿ“… 2024-03-20
๐Ÿ›๏ธ Proceedings of the VLDB Endowment
๐Ÿ“ˆ Citations: 0
โœจ Influential: 0
๐Ÿ“„ PDF
๐Ÿค– AI Summary
This paper addresses hypothesis testing at the node, edge, and path levels in large-scale attributed graphs, proposing the first formal multi-granularity graph hypothesis testing framework. To overcome the limitations of conventional tabular methods on graph-structured data, we design PHASEโ€”a path-hypothesis-aware random walk samplerโ€”and its optimized variant PHASE-opt, jointly optimizing hypothesis-driven sampling and computational efficiency. Our approach integrates graph sampling theory, *m*-dimensional random walk modeling, and rigorous time-complexity analysis to ensure statistical validity while enhancing scalability. Experiments on three real-world attributed graph datasets demonstrate that, compared to generic sampling baselines, our framework improves testing accuracy by 12.7% and accelerates runtime by 3.8ร—, significantly strengthening statistical inference capabilities for large-scale attributed graphs.

Technology Category

Application Category

๐Ÿ“ Abstract
Hypothesis testing is a statistical method used to draw conclusions about populations from sample data, typically represented in tables. With the prevalence of graph representations in real-life applications, hypothesis testing on graphs is gaining importance. In this work, we formalize node, edge, and path hypotheses on attributed graphs. We develop a sampling-based hypothesis testing framework, which can accommodate existing hypothesis-agnostic graph sampling methods. To achieve accurate and time-efficient sampling, we then propose a Path-Hypothesis-Aware SamplEr, PHASE, an m -dimensional random walk that accounts for the paths specified in the hypothesis. We further optimize its time efficiency and propose PHASE opt . Experiments on three real datasets demonstrate the ability of our framework to leverage common graph sampling methods for hypothesis testing, and the superiority of hypothesis-aware sampling methods in terms of accuracy and time efficiency.
Problem

Research questions and friction points this paper is trying to address.

Formalize node, edge, path hypotheses
Develop sampling-based hypothesis testing framework
Propose Path-Hypothesis-Aware SamplEr (PHASE)
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sampling-based hypothesis testing framework
Path-Hypothesis-Aware SamplEr (PHASE)
Optimized PHASE for time efficiency
๐Ÿ”Ž Similar Papers
No similar papers found.
Y
Yun Wang
The University of Hong Kong, Hong Kong SAR, China
C
Chrysanthi Kosyfaki
The University of Hong Kong, Hong Kong SAR, China
S
S. Amer-Yahia
CNRS, Univ. Grenoble Alpes, Grenoble, France
Reynold Cheng
Reynold Cheng
ACM Distinguished Member, HKU Computer Science Professor
Data UncertaintyGraph DatabasesData Science for Social Goods