AutoComp: Automated Data Compaction for Log-Structured Tables in Data Lakes

πŸ“… 2025-04-05
πŸ“ˆ Citations: 0
✨ Influential: 0
πŸ“„ PDF
πŸ€– AI Summary
Log-structured table formats (e.g., Delta Lake, Iceberg, Hudi) in data lakes suffer from excessive small files due to append-only writes and metadata-heavy operations, degrading query performance, increasing storage costs, and limiting system scalability. Existing compaction mechanisms lack flexibility, pursue narrow objectives, and fail to balance benefits against operational overhead. To address this, we propose Scalable Adaptive Compaction (SAC), an extensible, workload-aware, metadata-driven framework featuring dynamic threshold tuning, lightweight online evaluation, and a modular rule engine. SAC is production-deployed via the OpenHouse control plane. Evaluated on LinkedIn’s production workloads and synthetic benchmarks, SAC reduces file counts by up to 92%, improves typical query latency by 3.8Γ—, and maintains bounded runtime overhead.

Technology Category

Application Category

πŸ“ Abstract
The proliferation of small files in data lakes poses significant challenges, including degraded query performance, increased storage costs, and scalability bottlenecks in distributed storage systems. Log-structured table formats (LSTs) such as Delta Lake, Apache Iceberg, and Apache Hudi exacerbate this issue due to their append-only write patterns and metadata-intensive operations. While compaction--the process of consolidating small files into fewer, larger files--is a common solution, existing automation mechanisms often lack the flexibility and scalability to adapt to diverse workloads and system requirements while balancing the trade-offs between compaction benefits and costs. In this paper, we present AutoComp, a scalable framework for automatic data compaction tailored to the needs of modern data lakes. Drawing on deployment experience at LinkedIn, we analyze the operational impact of small file proliferation, establish key requirements for effective automatic compaction, and demonstrate how AutoComp addresses these challenges. Our evaluation, conducted using synthetic benchmarks and production environments via integration with OpenHouse--a control plane for catalog management, schema governance, and data services--shows significant improvements in file count reduction and query performance. We believe AutoComp's built-in extensibility provides a robust foundation for evolving compaction systems, facilitating future integration of refined multi-objective optimization approaches, workload-aware compaction strategies, and expanded support for broader data layout optimizations.
Problem

Research questions and friction points this paper is trying to address.

Reduces small file proliferation in data lakes
Improves query performance and storage efficiency
Automates compaction for log-structured tables
Innovation

Methods, ideas, or system contributions that make the work stand out.

Scalable framework for automatic data compaction
Integration with OpenHouse for performance evaluation
Extensible foundation for future optimization approaches
πŸ”Ž Similar Papers
No similar papers found.
A
Anja Gruenheid
Microsoft, Zurich, Switzerland
J
Jes'us Camacho-Rodr'iguez
Microsoft, Mountain View, CA, USA
C
Carlo Curino
Microsoft, Redmond, WA, USA
Raghu Ramakrishnan
Raghu Ramakrishnan
Microsoft
Database systemsdata mininginformation extractionranking/recommendationsocial networks
S
Stanislav Pak
LinkedIn, Sunnyvale, CA, USA
S
Sumedh Sakdeo
LinkedIn, Sunnyvale, CA, USA
L
Lenisha Gandhi
LinkedIn, Sunnyvale, CA, USA
S
Sandeep K. Singhal
LinkedIn, Sunnyvale, CA, USA
P
Pooja Nilangekar
University of Maryland, College Park, MD, USA
Daniel J. Abadi
Daniel J. Abadi
Darnell-Kanal Professor of Computer Science, University of Maryland College Park
Database Systems