🤖 AI Summary
To address high latency, excessive cost, weak privacy guarantees, and capability-resource mismatch in generative AI deployment across cloud-edge environments, this paper proposes a collaborative inference framework for edge-cloud co-inference. We systematically define four collaboration paradigms for the first time, design hierarchical scheduling principles, and introduce a lightweight communication protocol. Our method innovatively integrates model partitioning, dynamic offloading, quantization-aware gradient exchange, and cache-enhanced prompt routing. Evaluated on a hybrid testbed comprising Jetson AGX edge devices and a cloud cluster, the framework achieves a 63% reduction in end-to-end latency, a 71% decrease in communication overhead, and maintains inference quality exceeding 92% of large language model (LLM) baselines. This work delivers a scalable, system-level solution enabling efficient, secure, and high-fidelity deployment of generative AI in resource-constrained edge environments.
📝 Abstract
The rapid adoption of generative AI (GenAI), particularly Large Language Models (LLMs), has exposed critical limitations of cloud-centric deployments, including latency, cost, and privacy concerns. Meanwhile, Small Language Models (SLMs) are emerging as viable alternatives for resource-constrained edge environments, though they often lack the capabilities of their larger counterparts. This article explores the potential of collaborative inference systems that leverage both edge and cloud resources to address these challenges. By presenting distinct cooperation strategies alongside practical design principles and experimental insights, we offer actionable guidance for deploying GenAI across the computing continuum.