🤖 AI Summary
To address the query blocking, reduced QPS, and high memory overhead caused by global retraining of learned indexes (LIs) under dynamic updates, this paper proposes Sig2Model—a learned index supporting efficient local updates. Its core contributions are: (1) a Sigmoid-enhanced approximation technique that accurately models data distribution shifts; (2) an active identification mechanism for high-update regions using Gaussian Mixture Models (GMM), enabling incremental local retraining; and (3) a neural joint optimization framework that co-optimizes model parameters and placeholder allocation strategies. Experimental results demonstrate that Sig2Model reduces retraining cost by up to 20×, improves query throughput by up to 3×, and decreases memory footprint by up to 1000× compared to state-of-the-art methods—significantly enhancing practicality for dynamic workloads.
📝 Abstract
Learned Indexes (LIs) represent a paradigm shift from traditional index structures by employing machine learning models to approximate the cumulative distribution function (CDF) of sorted data. While LIs achieve remarkable efficiency for static datasets, their performance degrades under dynamic updates: maintaining the CDF invariant (sum of F(k) equals 1) requires global model retraining, which blocks queries and limits the queries-per-second (QPS) metric. Current approaches fail to address these retraining costs effectively, rendering them unsuitable for real-world workloads with frequent updates. In this paper, we present Sig2Model, an efficient and adaptive learned index that minimizes retraining cost through three key techniques: (1) a sigmoid boosting approximation technique that dynamically adjusts the index model by approximating update-induced shifts in data distribution with localized sigmoid functions while preserving bounded error guarantees and deferring full retraining; (2) proactive update training via Gaussian mixture models (GMMs) that identifies high-update-probability regions for strategic placeholder allocation to speed up updates; and (3) a neural joint optimization framework that continuously refines both the sigmoid ensemble and GMM parameters via gradient-based learning. We evaluate Sig2Model against state-of-the-art updatable learned indexes on real-world and synthetic workloads, and show that Sig2Model reduces retraining cost by up to 20x, achieves up to 3x higher QPS, and uses up to 1000x less memory.