Modular Foundation Model Inference at the Edge: Network-Aware Microservice Optimization

📅 2026-01-27

📈 Citations: 0

✨ Influential: 0

career value

255K/year

🤖 AI Summary

Deploying foundation models at the edge faces significant challenges due to resource constraints and network dynamics, making it difficult to simultaneously ensure real-time performance, privacy, and quality of service (QoS). This work proposes a microservice-based inference framework for foundation models that leverages functional asymmetry between core and lightweight services, enabling a two-tier deployment strategy combining static and dynamic coordination. The approach innovatively integrates network-aware sparse integer programming with Lyapunov-based online optimization, grounded in effective capacity theory, to guarantee QoS and fault tolerance under high load. Simulations demonstrate that the proposed method achieves an average task on-time completion rate exceeding 84% under moderate deployment costs, while exhibiting strong scalability and robustness.

Technology Category

Application Category

📝 Abstract

Foundation models (FMs) unlock unprecedented multimodal and multitask intelligence, yet their cloud-centric deployment precludes real-time responsiveness and compromises user privacy. Meanwhile, monolithic execution at the edge remains infeasible under stringent resource limits and uncertain network dynamics. To bridge this gap, we propose a microservice-based FM inference framework that exploits the intrinsic functional asymmetry between heavyweight core services and agile light services. Our two-tier deployment strategy ensures robust Quality of Service (QoS) under resource contention. Specifically, core services are placed statically via a long-term network-aware integer program with sparsity constraints to form a fault-tolerant backbone. On the other hand, light services are orchestrated dynamically by a low-complexity online controller that integrates effective capacity theory with Lyapunov optimization, providing probabilistic latency guarantees under real-time workload fluctuations. Simulations demonstrate that our framework achieves over 84% average on-time task completion with moderate deployment costs and maintains strong robustness as the system load scales.

Problem

Research questions and friction points this paper is trying to address.

Foundation Models

Edge Inference

Microservice Optimization

Network-Aware Deployment

Resource Constraints

Innovation

Methods, ideas, or system contributions that make the work stand out.

modular foundation models

edge inference

microservice optimization