CoFormer: Collaborating with Heterogeneous Edge Devices for Scalable Transformer Inference

📅 2025-08-27

📈 Citations: 0

✨ Influential: 0

career value

194K/year

🤖 AI Summary

To address the tension between resource-constrained edge devices and the high computational/memory overhead of Transformer models, this paper proposes CoFormer, a collaborative inference system. CoFormer is the first to exploit the structural decomposability and output integrability of Transformers, enabling model partitioning and distributed deployment across heterogeneous edge devices for cross-device collaborative inference. We design the Decompose-and-Boost (DeBo) optimization algorithm to derive optimal decomposition strategies, coupled with intermediate-result aggregation and progressive calibration to preserve accuracy while substantially reducing system overhead. Experiments demonstrate that, compared to baselines, CoFormer achieves 3.1× inference speedup, 76.3% memory reduction, and ~40% energy savings on GPT2-XL—marking the first demonstration of low-latency, high-accuracy, and energy-efficient large-model inference in edge computing scenarios.

Technology Category

Application Category

📝 Abstract

The impressive performance of transformer models has sparked the deployment of intelligent applications on resource-constrained edge devices. However, ensuring high-quality service for real-time edge systems is a significant challenge due to the considerable computational demands and resource requirements of these models. Existing strategies typically either offload transformer computations to other devices or directly deploy compressed models on individual edge devices. These strategies, however, result in either considerable communication overhead or suboptimal trade-offs between accuracy and efficiency. To tackle these challenges, we propose a collaborative inference system for general transformer models, termed CoFormer. The central idea behind CoFormer is to exploit the divisibility and integrability of transformer. An off-the-shelf large transformer can be decomposed into multiple smaller models for distributed inference, and their intermediate results are aggregated to generate the final output. We formulate an optimization problem to minimize both inference latency and accuracy degradation under heterogeneous hardware constraints. DeBo algorithm is proposed to first solve the optimization problem to derive the decomposition policy, and then progressively calibrate decomposed models to restore performance. We demonstrate the capability to support a wide range of transformer models on heterogeneous edge devices, achieving up to 3.1$ imes$ inference speedup with large transformer models. Notably, CoFormer enables the efficient inference of GPT2-XL with 1.6 billion parameters on edge devices, reducing memory requirements by 76.3%. CoFormer can also reduce energy consumption by approximately 40% while maintaining satisfactory inference performance.

Problem

Research questions and friction points this paper is trying to address.

Enables scalable transformer inference on resource-constrained edge devices

Minimizes latency and accuracy degradation under hardware constraints

Reduces communication overhead and energy consumption for transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Decomposes large transformers into smaller distributed models

Uses DeBo algorithm for optimization and calibration

Reduces latency and energy while maintaining accuracy

🔎 Similar Papers

No similar papers found.