🤖 AI Summary
Deploying foundation models at the edge faces significant challenges due to resource constraints and network dynamics, making it difficult to simultaneously ensure real-time performance, privacy, and quality of service (QoS). This work proposes a microservice-based inference framework for foundation models that leverages functional asymmetry between core and lightweight services, enabling a two-tier deployment strategy combining static and dynamic coordination. The approach innovatively integrates network-aware sparse integer programming with Lyapunov-based online optimization, grounded in effective capacity theory, to guarantee QoS and fault tolerance under high load. Simulations demonstrate that the proposed method achieves an average task on-time completion rate exceeding 84% under moderate deployment costs, while exhibiting strong scalability and robustness.
📝 Abstract
Foundation models (FMs) unlock unprecedented multimodal and multitask intelligence, yet their cloud-centric deployment precludes real-time responsiveness and compromises user privacy. Meanwhile, monolithic execution at the edge remains infeasible under stringent resource limits and uncertain network dynamics. To bridge this gap, we propose a microservice-based FM inference framework that exploits the intrinsic functional asymmetry between heavyweight core services and agile light services. Our two-tier deployment strategy ensures robust Quality of Service (QoS) under resource contention. Specifically, core services are placed statically via a long-term network-aware integer program with sparsity constraints to form a fault-tolerant backbone. On the other hand, light services are orchestrated dynamically by a low-complexity online controller that integrates effective capacity theory with Lyapunov optimization, providing probabilistic latency guarantees under real-time workload fluctuations. Simulations demonstrate that our framework achieves over 84% average on-time task completion with moderate deployment costs and maintains strong robustness as the system load scales.