CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference

📅 2025-08-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the tension between resource-constrained edge devices and the high computational/memory overhead of Transformer models, this paper proposes CoFormer, a collaborative inference system. CoFormer is the first to exploit the structural decomposability and output integrability of Transformers, enabling model partitioning and distributed deployment across heterogeneous edge devices for cross-device collaborative inference. We design the Decompose-and-Boost (DeBo) optimization algorithm to derive optimal decomposition strategies, coupled with intermediate-result aggregation and progressive calibration to preserve accuracy while substantially reducing system overhead. Experiments demonstrate that, compared to baselines, CoFormer achieves 3.1× inference speedup, 76.3% memory reduction, and ~40% energy savings on GPT2-XL—marking the first demonstration of low-latency, high-accuracy, and energy-efficient large-model inference in edge computing scenarios.

Technology Category

Application Category

📝 Abstract
The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1$ imes$ inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.
Problem

Research questions and friction points this paper is trying to address.

Enables scalable transformer inference on resource-constrained edge devices
Minimizes latency and accuracy degradation under hardware constraints
Reduces communication overhead and energy consumption for transformers
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes large transformers into smaller distributed models
Uses DeBo algorithm for optimization and calibration
Reduces latency and energy while maintaining accuracy
🔎 Similar Papers
No similar papers found.
G
Guanyu Xu
School of Information and Electrionics, Beijing Institute of Technology, Beijing 100081, China
Zhiwei Hao
Zhiwei Hao
Beijing Institute of Technology
computer visionefficient deep learning
L
Li Shen
School of Cyber Science and Technology, Shenzhen Campus of Sun Yat-sen University, Shenzhen 518107, China
Yong Luo
Yong Luo
Wuhan University
Artifical IntelligenceMachine LearningData MiningPattern Classification and Search
F
Fuhui Sun
Information Technology Service Center of People’s Court, Beijing, 100745, China
X
Xiaoyan Wang
Information Technology Service Center of People’s Court, Beijing, 100745, China
H
Han Hu
School of Information and Electrionics, Beijing Institute of Technology, Beijing 100081, China
Yonggang Wen
Yonggang Wen
FIEEE, FSAEng, Professor & President's Chair, Nanyang Technological University Singapore
Data CenterDigital TwinMultimedia ComputingGreen Computing