🤖 AI Summary
This work addresses the challenges of high startup latency and low query efficiency in graph analytics under lakehouse architectures. The authors propose a lakehouse-native graph processing engine that maps lakehouse tables to vertex and edge types in a property graph and enables efficient querying through GSQL. Key innovations include loading only graph topology to accelerate system initialization, designing a graph-aware caching mechanism, and developing two lakehouse-optimized parallel primitives for graph computation. Experimental evaluation demonstrates that the proposed system significantly outperforms PuppyGraph—the current state-of-the-art—in both startup time and query latency across a range of workloads.
📝 Abstract
In this paper, we introduce GraphLake, a purpose-built graph compute engine for Lakehouse. GraphLake is built on top of the commercial graph database TigerGraph. It maps Lakehouse tables to vertex and edge types in a labeled property graph and supports graph analytics over Lakehouse tables using GSQL. To minimize startup time, it loads only the graph topology. Furthermore, it introduces a series of techniques to ensure query efficiency over Lakehouse tables, including a graph-aware caching mechanism and two Lakehouse-optimized parallel primitives. Extensive experiments demonstrate that GraphLake significantly outperforms PuppyGraph, the current state-of-the-art graph compute engine for Lakehouse, by achieving both lower startup and query time.