🤖 AI Summary
To address the challenge of automatic SQL query engine selection in lakehouse-integrated, multi-engine environments, this paper proposes a prior-knowledge-free cross-engine query routing method. Our approach centers on a unified multi-task learning cost model trained on optimized logical query plans, introducing a novel zero-shot/low-shot adaptation architecture capable of jointly predicting costs across multiple engines and instance configurations—eliminating the need for engine-specific modeling. By co-modeling query plan encoding, multi-task cost prediction, and hint-guided logical optimization, our method significantly improves prediction accuracy and generalization: Q-error decreases by 12.6%; total workload execution time reduces by 25.2% in zero-shot settings and by 30.4% in low-shot settings. The framework effectively mitigates selection complexity arising from newly introduced engines or workloads.
📝 Abstract
Lakehouse systems enable the same data to be queried with multiple execution engines. However, selecting the engine best suited to run a SQL query still requires a priori knowledge of the query computational requirements and an engine capability, a complex and manual task that only becomes more difficult with the emergence of new engines and workloads. In this paper, we address this limitation by proposing a cross-engine optimizer that can automate engine selection for diverse SQL queries through a learned cost model. Optimized with hints, a query plan is used for query cost prediction and routing. Cost prediction is formulated as a multi-task learning problem, and multiple predictor heads, corresponding to different engines and provisionings, are used in the model architecture. This eliminates the need to train engine-specific models and allows the flexible addition of new engines at a minimal fine-tuning cost. Results on various databases and engines show that using a query optimized logical plan for cost estimation decreases the average Q-error by even 12.6% over using unoptimized plans as input. Moreover, the proposed cross-engine optimizer reduces the total workload runtime by up to 25.2% in a zero-shot setting and 30.4% in a few-shot setting when compared to random routing.