MACE: A Hybrid LLM Serving System with Colocated SLO-aware Continuous Retraining Alignment

📅 2025-09-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Deploying large language models (LLMs) on edge servers for latency-sensitive applications—e.g., personalized assistants—is challenged by non-stationary user data, which necessitates frequent model retraining and thereby compromises the trade-off between inference latency and accuracy. Method: We propose the first hybrid serving system supporting *iteration-level scheduling*, co-locating SLO-aware fine-grained retraining with inference. It enables concurrent execution of prefill/decode and fine-tuning, employs intelligent memory management, and prioritizes GPU cycles dynamically across tasks. Contribution/Results: Evaluated on an NVIDIA AGX Orin platform, our system reduces end-to-end inference latency by up to 63% while maintaining throughput and sustaining GPU utilization above 85%. It significantly outperforms periodic retraining baselines, demonstrating effective real-time resource coordination for adaptive LLM serving at the edge.

Technology Category

Application Category

📝 Abstract
Large language models (LLMs) deployed on edge servers are increasingly used in latency-sensitive applications such as personalized assistants, recommendation, and content moderation. However, the non-stationary nature of user data necessitates frequent retraining, which introduces a fundamental tension between inference latency and model accuracy under constrained GPU resources. Existing retraining strategies either delay model updates, over-commit resources to retraining, or overlook iteration-level retraining granularity. In this paper, we identify that iteration-level scheduling is crucial for adapting retraining frequency to model drift without violating service-level objectives (SLOs). We propose MACE, a hybrid LLM system that colocates concurrent inference (prefill, decode) and fine-tuning, with intelligent memory management to maximize task performance while promising inference throughput. MACE leverages the insight that not all model updates equally affect output alignment and allocates GPU cycles accordingly to balance throughput, latency, and update freshness. Our trace-driven evaluation shows that MACE matches or exceeds continuous retraining while reducing inference latency by up to 63% and maintaining throughput under resource constraints. Compared to periodic retraining, MACE improves latency breakdown across prefill, decode, and finetune stages, and sustains GPU utilization above 85% in NVIDIA AGX Orin. These results demonstrate that iteration-level hybrid scheduling is a promising direction for deploying LLMs with continual learning capabilities on edge platforms.
Problem

Research questions and friction points this paper is trying to address.

Balancing inference latency and model accuracy during frequent retraining
Managing GPU resources for concurrent inference and fine-tuning tasks
Maintaining service-level objectives while adapting to non-stationary user data
Innovation

Methods, ideas, or system contributions that make the work stand out.

Colocated inference and fine-tuning with intelligent memory management
Iteration-level scheduling for SLO-aware retraining frequency adaptation
Selective GPU cycle allocation based on update impact alignment
🔎 Similar Papers
No similar papers found.