Low-Latency ML Inference by Grouping Correlated Data Objects and Computation

📅 2023-11-30

🏛️ arXiv.org

📈 Citations: 0

✨ Influential: 0

career value

238K/year

🤖 AI Summary

AI-driven streaming inference workflows face challenges due to dynamically shifting data access patterns under event-driven execution, rendering conventional caching and scheduling techniques ineffective for latency reduction. To address this, we propose an application-aware data-computation correlation modeling mechanism, introducing the novel abstraction of “correlation grouping.” This enables developers to explicitly declare dynamic data dependencies via lightweight runtime annotations—without modifying core logic. Leveraging this abstraction, we design correlation-driven cross-node cooperative scheduling, streaming workflow orchestration, and cluster-wide data-affinity scheduling policies. Evaluated on real-world latency-sensitive ML applications, our approach achieves a 58% reduction in end-to-end latency, a 73% decrease in latency variance, and a 41% improvement in node utilization—all with minimal code changes.

📝 Abstract

ML inference workflows often require low latency and high throughput, yet we lack good options for addressing this need. Techniques that reduce latency in other streaming settings (such as caching and optimization-driven scheduling) are of limited value because ML data dependencies are often very large and can change dramatically depending on the triggering event. In this work, we propose a novel correlation grouping mechanism that makes it easier for developers to express application-specific data access correlations, enabling coordinated management of data objects in server clusters hosting streaming inference tasks. Experiments based on a latency-sensitive ML-based application confirm the limitations of standard techniques while showing that our solution yields dramatically better performance. The proposed mechanism is able to maintain significantly lower and more consistent latency, achieves higher node utilization as workload and scale-out increase, and yet requires only minor changes to the code implementing the application.

Problem

Research questions and friction points this paper is trying to address.

Reducing latency in AI inference workflows with dynamic data access patterns

Enabling coordinated data management in server clusters for streaming tasks

Improving performance without major code changes via affinity grouping

Innovation

Methods, ideas, or system contributions that make the work stand out.

Affinity grouping for AI data access

Coordinated management in server clusters

Low-latency scalable AI inference

🔎 Similar Papers

No similar papers found.