MLKV: Efficiently Scaling up Large Embedding Model Training with Disk-based Key-Value Storage

📅 2025-04-02

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

Embedding model training suffers from data staleness and retention issues due to continuously expanding embedding tables, while existing domain-specific frameworks redundantly reimplement storage components, incurring substantial engineering overhead. This paper introduces MLKV—the first general-purpose, disk-resident key-value store framework designed specifically for embedding training. MLKV provides a reusable embedding storage abstraction across tasks via embedding sharding, asynchronous prefetching, low-overhead versioning, and a lightweight API. Its core innovation lies in democratizing previously siloed storage optimizations—enabling their cross-framework reuse and unified support for diverse training scenarios. Evaluated on open-source benchmarks and real-world eBay payment fraud detection workloads, MLKV achieves 1.6×–12.6× higher training throughput compared to industrial-grade KV offloading solutions. The system is publicly open-sourced.

Technology Category

Application Category

📝 Abstract

Many modern machine learning (ML) methods rely on embedding models to learn vector representations (embeddings) for a set of entities (embedding tables). As increasingly diverse ML applications utilize embedding models and embedding tables continue to grow in size and number, there has been a surge in the ad-hoc development of specialized frameworks targeted to train large embedding models for specific tasks. Although the scalability issues that arise in different embedding model training tasks are similar, each of these frameworks independently reinvents and customizes storage components for specific tasks, leading to substantial duplicated engineering efforts in both development and deployment. This paper presents MLKV, an efficient, extensible, and reusable data storage framework designed to address the scalability challenges in embedding model training, specifically data stall and staleness. MLKV augments disk-based key-value storage by democratizing optimizations that were previously exclusive to individual specialized frameworks and provides easy-to-use interfaces for embedding model training tasks. Extensive experiments on open-source workloads, as well as applications in eBay's payment transaction risk detection and seller payment risk detection, show that MLKV outperforms offloading strategies built on top of industrial-strength key-value stores by 1.6-12.6x. MLKV is open-source at https://github.com/llm-db/MLKV.

Problem

Research questions and friction points this paper is trying to address.

Scaling large embedding model training efficiently

Reducing duplicated storage component development efforts

Addressing data stall and staleness issues

Innovation

Methods, ideas, or system contributions that make the work stand out.

Disk-based key-value storage for scalability

Democratizes optimizations from specialized frameworks

Easy-to-use interfaces for embedding tasks

🔎 Similar Papers

No similar papers found.