SAIR: Cost-Efficient Multi-Stage ML Pipeline Autoscaling via In-Context Reinforcement Learning

📅 2026-01-29

📈 Citations: 0

✨ Influential: 0

career value

215K/year

🤖 AI Summary

This work addresses the challenge of efficiently autoscaling multi-stage machine learning inference pipelines, which is hindered by resource heterogeneity, stage coupling, and dynamically shifting bottlenecks. The authors propose SAIR, a novel framework that leverages a large language model as an in-context reinforcement learning controller to optimize scaling policies online using reward-labeled interaction histories—without requiring gradient updates or offline training. Key innovations include Pareto-dominance-based reward shaping, provably separable decision boundaries, surprise-guided experience retrieval, and fine-grained GPU rate control via user-space CUDA interception. Evaluated across four ML pipelines and three workload types, SAIR reduces P99 latency by up to 50%, cuts resource costs by as much as 97%, achieves 86% accuracy in bottleneck detection, and matches or outperforms existing baselines.

Technology Category

Application Category

📝 Abstract

Multi-stage ML inference pipelines are difficult to autoscale due to heterogeneous resources, cross-stage coupling, and dynamic bottleneck migration. We present SAIR, an autoscaling framework that uses an LLM as an in-context reinforcement learning controller, improving its policy online from reward-labeled interaction histories without gradient updates. SAIR combines Pareto-dominance reward shaping with a provable separation margin, surprisal-guided experience retrieval for context efficiency, and fine-grained GPU rate control via user-space CUDA interception. We provide regret analysis decomposing error into retrieval coverage and LLM selection components. On four ML serving pipelines under three workload patterns, SAIR achieves the best or tied-best P99 latency and effective resource cost among deployed baselines, improving P99 by up to 50% and reducing effective cost by up to 97% (under GPU rate-control assumptions), with 86% bottleneck detection accuracy and no offline training.

Problem

Research questions and friction points this paper is trying to address.

multi-stage ML pipeline

autoscaling

dynamic bottleneck migration

heterogeneous resources

cross-stage coupling

Innovation

Methods, ideas, or system contributions that make the work stand out.

in-context reinforcement learning

LLM-based autoscaling

Pareto-dominance reward shaping