🤖 AI Summary
This work addresses key challenges in multimodal data analytics, including inefficient I/O, rigid query optimization, and performance degradation caused by resource decoupling. To overcome these limitations, the authors propose a cloud-native OLAP engine that integrates a unified table engine—combining two-layer logical abstraction with physically consistent layout—alongside CrossCache, a cluster-wide shared SSD cache, and NexusFS, a virtual file system enabling efficient local data access. Furthermore, the system incorporates a query optimizer that leverages historical execution traces and AI-driven insights to enhance plan selection. Experimental results demonstrate that the proposed system significantly outperforms existing solutions in multimodal query efficiency, resource utilization, and end-to-end latency, enabling highly effective coordination among analytical, batch, and incremental workloads.
📝 Abstract
With the rapid rise of intelligent data services, modern enterprises increasingly require efficient, multimodal, and cost-effective data analytics infrastructures. However, in ByteDance's production environments, existing systems fall short due to limitations such as I/O-inefficient multimodal storage, inflexible query optimization (e.g., failing to optimize multimodal access patterns), and performance degradation caused by resource disaggregation (e.g., loss of data locality in remote storage). To address these challenges, we introduce ByteHouse (https://bytehouse.cloud), a cloud-native data warehouse designed for real-time multimodal data analytics. The storage layer integrates a unified table engine that provides a two-tier logical abstraction and physically consistent layout, SSD-backed cluster-scale cache (CrossCache) that supports shared caching across compute nodes, and virtual file system (NexusFS) that enable efficient local access on compute nodes. The compute layer supports analytical, batch, and incremental execution modes, with tailored optimizations for hybrid queries (e.g., runtime filtering over tiered vector indexes). The control layer coordinates global metadata and transactions, and features an effective optimizer enhanced by historical execution traces and AI-assisted plan selection. Evaluations on internal and standard workloads show that ByteHouse achieves significant efficiency improvement over existing systems.