The Larger the Merrier? Efficient Large AI Model Inference in Wireless Edge Networks

📅 2025-05-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Low-latency, privacy-sensitive inference of large AI models (LAIMs) in wireless edge networks remains challenging due to stringent resource constraints and data privacy requirements. Method: This paper proposes a pruning-aware edge–end collaborative inference framework: a pre-trained LAIM is pruned and dynamically partitioned between end devices and edge servers for joint execution. We first establish a theoretical relationship between parameter distortion and output distortion, deriving an analytical lower bound on inference accuracy with respect to pruning ratio via rate-distortion theory. Then, we formulate and efficiently solve a non-convex joint optimization problem involving pruning ratio, transmit power, and CPU frequency. Results: Experiments confirm that parameter distortion serves as a reliable upper bound for output distortion. The framework achieves superior trade-offs among latency, energy consumption, and accuracy—significantly outperforming both fully on-device and fully server-side baselines. Moreover, it reveals the critical impact of model partitioning points under heterogeneous, resource-constrained edge environments.

Technology Category

Application Category

📝 Abstract
The growing demand for large artificial intelligence model (LAIM) services is driving a paradigm shift from traditional cloud-based inference to edge-based inference for low-latency, privacy-preserving applications. In particular, edge-device co-inference, which partitions LAIMs between edge devices and servers, has emerged as a promising strategy for resource-efficient LAIM execution in wireless networks. In this paper, we investigate a pruning-aware LAIM co-inference scheme, where a pre-trained LAIM is pruned and partitioned into on-device and on-server sub-models for deployment. For analysis, we first prove that the LAIM output distortion is upper bounded by its parameter distortion. Then, we derive a lower bound on parameter distortion via rate-distortion theory, analytically capturing the relationship between pruning ratio and co-inference performance. Next, based on the analytical results, we formulate an LAIM co-inference distortion bound minimization problem by jointly optimizing the pruning ratio, transmit power, and computation frequency under system latency, energy, and available resource constraints. Moreover, we propose an efficient algorithm to tackle the considered highly non-convex problem. Finally, extensive simulations demonstrate the effectiveness of the proposed design. In particular, model parameter distortion is shown to provide a reliable bound on output distortion. Also, the proposed joint pruning ratio and resource management design achieves superior performance in balancing trade-offs among inference performance, system latency, and energy consumption compared with benchmark schemes, such as fully on-device and on-server inference. Moreover, the split point is shown to play a critical role in system performance optimization under heterogeneous and resource-limited edge environments.
Problem

Research questions and friction points this paper is trying to address.

Optimizing large AI model inference in wireless edge networks
Balancing pruning ratio and co-inference performance trade-offs
Jointly managing resources for latency and energy efficiency
Innovation

Methods, ideas, or system contributions that make the work stand out.

Pruning-aware LAIM co-inference scheme
Joint pruning ratio and resource optimization
Edge-device partitioned model deployment
🔎 Similar Papers
No similar papers found.