🤖 AI Summary
This work exposes a critical security vulnerability in graph neural network (GNN) interpretability: subgraph-based explanations inadvertently leak the model’s core decision logic, enabling black-box model stealing. To address this, we propose EGSteal—a novel framework that introduces “explanation alignment” to reverse-engineer the target GNN’s reasoning process. EGSteal integrates gradient-guided subgraph sampling with adversarial data augmentation to substantially improve query efficiency. Experiments on molecular graph datasets demonstrate that EGSteal achieves high-fidelity replication of both predictive behavior and reasoning patterns of the target model using ≤1% query budget—significantly outperforming existing model extraction methods. This study is the first to systematically establish the thesis “explanations entail risk,” revealing an inherent trade-off between interpretability and security in GNNs. Our findings provide foundational insights and technical tools for securing explainable AI systems, underscoring the urgent need for safety-aware interpretability frameworks in trustworthy machine learning.
📝 Abstract
Graph Neural Networks (GNNs) have become essential tools for analyzing graph-structured data in domains such as drug discovery and financial analysis, leading to growing demands for model transparency. Recent advances in explainable GNNs have addressed this need by revealing important subgraphs that influence predictions, but these explanation mechanisms may inadvertently expose models to security risks. This paper investigates how such explanations potentially leak critical decision logic that can be exploited for model stealing. We propose {method}, a novel stealing framework that integrates explanation alignment for capturing decision logic with guided data augmentation for efficient training under limited queries, enabling effective replication of both the predictive behavior and underlying reasoning patterns of target models. Experiments on molecular graph datasets demonstrate that our approach shows advantages over conventional methods in model stealing. This work highlights important security considerations for the deployment of explainable GNNs in sensitive domains and suggests the need for protective measures against explanation-based attacks. Our code is available at https://github.com/beanmah/EGSteal.