TLV-HGNN: Thinking Like a Vertex for Memory-efficient HGNN Inference

📅 2025-08-11
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Heterogeneous Graph Neural Network (HGNN) inference suffers from low memory efficiency during neighbor aggregation—caused by semantic-separated execution (inducing redundant intermediate state storage) and overlapping neighborhoods across semantics (leading to repeated feature loading and neighbor access). Method: We propose the “Semantically Complete Execution Paradigm,” which restructures computation from a vertex-centric perspective to eliminate intermediate storage and fuse shared neighbor accesses. We further design a vertex-grouping strategy to jointly optimize memory access locality and build TVL-HGNN, a reconfigurable hardware accelerator. Results: Experiments show that TVL-HGNN achieves 7.85× and 1.41× average speedup over NVIDIA A100 GPU and the state-of-the-art accelerator HiHGNN, respectively, while reducing energy consumption by 98.79% and 32.61%. The approach significantly enhances scalability and energy efficiency of HGNN inference.

Technology Category

Application Category

📝 Abstract
Heterogeneous graph neural networks (HGNNs) excel at processing heterogeneous graph data and are widely applied in critical domains. In HGNN inference, the neighbor aggregation stage is the primary performance determinant, yet it suffers from two major sources of memory inefficiency. First, the commonly adopted per-semantic execution paradigm stores intermediate aggregation results for each semantic prior to semantic fusion, causing substantial memory expansion. Second, the aggregation process incurs extensive redundant memory accesses, including repeated loading of target vertex features across semantics and repeated accesses to shared neighbors due to cross-semantic neighborhood overlap. These inefficiencies severely limit scalability and reduce HGNN inference performance. In this work, we first propose a semantics-complete execution paradigm from a vertex perspective that eliminates per-semantic intermediate storage and redundant target vertex accesses. Building on this paradigm, we design TVL-HGNN, a reconfigurable hardware accelerator optimized for efficient aggregation. In addition, we introduce a vertex grouping technique based on cross-semantic neighborhood overlap, with hardware implementation, to reduce redundant accesses to shared neighbors. Experimental results demonstrate that TVL-HGNN achieves average speedups of 7.85x and 1.41x over the NVIDIA A100 GPU and the state-of-the-art HGNN accelerator HiHGNN, respectively, while reducing energy consumption by 98.79% and 32.61%.
Problem

Research questions and friction points this paper is trying to address.

Memory inefficiency in HGNN neighbor aggregation stage
Redundant memory accesses due to repeated feature loading
Storage expansion from per-semantic intermediate results
Innovation

Methods, ideas, or system contributions that make the work stand out.

Semantics-complete execution paradigm eliminates intermediate storage
Reconfigurable hardware accelerator optimizes efficient aggregation
Vertex grouping technique reduces redundant neighbor accesses
🔎 Similar Papers
No similar papers found.
Dengke Han
Dengke Han
Institute of Computing Technology, Chinese Academy of Sciences
graph-based hardware acceleratorhigh-throughput computer architecture
D
Duo Wang
State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
M
Mingyu Yan
State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
X
Xiaochun Ye
State Key Lab of Processors, Institute of Computing Technology, CAS; University of Chinese Academy of Sciences
Dongrui Fan
Dongrui Fan
Institute of Computing Technology, Chinese Academy of Sciences
Computer ArchitectureProcessor DesignMany-core Design