🤖 AI Summary
This work addresses the challenge of deploying large language models (LLMs) for intelligent Earth observation on low-Earth-orbit satellites, which are constrained by limited memory and high inference latency. To overcome these limitations, the authors propose a multi-satellite collaborative inference framework that partitions an LLM into submodels distributed across satellites, integrating pipeline parallelism with an adaptive compression mechanism for intermediate activations to reduce communication overhead and latency. The inference latency minimization problem is innovatively formulated as a shortest-path problem on a weighted directed acyclic graph, for which an enhanced A* algorithm is developed to achieve efficient optimization. Experimental results demonstrate that, compared to existing approaches, the proposed method reduces inference latency by up to 42%, cuts communication overhead by 71%, and incurs less than 1% accuracy degradation.
📝 Abstract
Low Earth orbit (LEO) satellites play an essential role in intelligent Earth observation by leveraging artificial intelligence models. However, limited onboard memory and excessive inference delay prevent the practical deployment of large language models (LLMs) on a single satellite. In this paper, we propose a communication-efficient collaborative LLM inference scheme for LEO satellite networks. Specifically, the entire LLM is split into multiple sub-models, with each deployed on a satellite, thereby enabling collaborative LLM inference via exchanging intermediate activations between satellites. The proposed scheme also leverages the pipeline parallelism mechanism that overlaps sub-model inference with intermediate activation transmission, thereby reducing LLM inference delay. An adaptive activation compression scheme is designed to mitigate cumulative errors from multi-stage model splitting while preserving inference accuracy. Furthermore, we formulate the LLM inference delay minimization problem by jointly optimizing model splitting and compression ratios under onboard memory and inference accuracy constraints. The problem is transformed into a shortest-path search problem over a directed acyclic graph that edge weights explicitly quantify the inference delay induced by model splitting and compression strategies, which is solved via a modified A Star-based search algorithm. Extensive simulation results indicate that the proposed solution can reduce inference delay by up to 42% and communication overhead by up to 71% compared to state-of-the-art benchmarks, while maintaining the inference accuracy loss of less than 1%.