🤖 AI Summary
This paper addresses three critical challenges in multi-tenant, serverless NoSQL databases for large-scale cloud environments under dynamic, heterogeneous workloads: (1) cache-induced performance isolation breakdown, (2) sluggish responsiveness to traffic surges, and (3) imbalanced resource allocation across data nodes. To tackle these, the authors propose: (1) a novel two-tier caching architecture with cache-aware, fine-grained performance isolation; (2) a time-series forecasting–driven auto-scaling policy for rapid elasticity; and (3) a multidimensional rescheduling algorithm jointly optimizing CPU, memory, and I/O resources. Deployed in ByteDance’s production infrastructure—the world’s largest serverless NoSQL deployment—the system supports peak query throughput exceeding 13 billion QPS and total storage over 1 exabyte. Evaluation results demonstrate strong performance isolation, 57% reduction in service jitter, and over 35% improvement in aggregate resource utilization.
📝 Abstract
Multi-tenant architectures enhance the elasticity and resource utilization of NoSQL databases by allowing multiple tenants to co-locate and share resources. However, in large-scale cloud environments, the diverse and dynamic nature of workloads poses significant challenges for multi-tenant NoSQL databases. Based on our practical observations, we have identified three crucial challenges: (1) the impact of caching on performance isolation, as cache hits alter request execution and resource consumption, leading to inaccurate traffic control; (2) the dynamic changes in traffic, with changes in tenant traffic trends causing throttling or resource wastage, and changes in access distribution causing hot key pressure or cache hit ratio drops; and (3) the imbalanced layout of data nodes due to tenants' diverse resource requirements, leading to low resource utilization. To address these challenges, we introduce ABase, a multi-tenant NoSQL serverless database developed at ByteDance. ABase introduces a two-layer caching mechanism with a cache-aware isolation mechanism to ensure accurate resource consumption estimates. Furthermore, ABase employs a predictive autoscaling policy to dynamically adjust resources in response to tenant traffic changes and a multi-resource rescheduling algorithm to balance resource utilization across data nodes. With these innovations, ABase has successfully served ByteDance's large-scale cloud environment, supporting a total workload that has achieved a peak QPS of over 13 billion and total storage exceeding 1 EB.