EPIC: Abstraction and Polymorphism of In-Network Collectives on Ethernet

📅 2026-05-18
📈 Citations: 0
Influential: 0
📄 PDF

career value

229K/year
🤖 AI Summary
This work addresses the challenge of deploying in-network collective communication (INC) in open Ethernet ecosystems, where cross-layer complexity hinders practical adoption. To overcome this, the authors propose EPIC, a protocol grounded in the principle of “unified abstraction with polymorphic implementation.” EPIC defines an INC interface compatible with standard Ethernet and supports multiple implementation paths tailored to diverse hardware capabilities. It introduces a modular INC architecture enabling incremental hardware evolution, employs formal verification to ensure correctness across polymorphic implementations, and incorporates a generic resource management model adaptable to varied deployment scenarios. Comprehensive evaluation—spanning model checking, simulation, virtualized emulation, and hardware validation on Tofino and FPGA platforms—demonstrates that EPIC achieves significant improvements in functional correctness, performance, and real-world deployability.
📝 Abstract
In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities. We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies to prove the correctness of all proposed polymorphic modes; and third, we develop a unified resource management model versatile enough for diverse INC scenarios. Extensive validation -- spanning model checking, packet/flow simulations, VM emulation, Tofino Testbed, and FPGA/RTL verification -- confirms EPIC's correctness, performance gain, and feasibility.
Problem

Research questions and friction points this paper is trying to address.

In-Network Collective
Ethernet
Abstraction
Polymorphism
AI acceleration
Innovation

Methods, ideas, or system contributions that make the work stand out.

In-Network Collective
Ethernet abstraction
Polymorphic realization
Formal verification
Resource management
Y
Yitao Yuan
PKU
J
Jianglong Nie
PKU
T
Tianyu Bai
PKU
Ruizhe Zhou
Ruizhe Zhou
Northwestern Polytechnical University
UAV communicationsNavigation and positioning
S
Siyuan Cao
PKU
X
Xujie Fan
PKU
Y
Yuchen Xu
PKU
J
Junkai Chen
PKU
C
Chenqi Zhao
PKU
N
Nengyuan Zhang
PKU
S
Shaoke Fang
PKU
J
Jiangyuan Chen
USTC
Y
Yuanfeng Chen
NUDT
Jiaqi Sun
Jiaqi Sun
Carnegie Mellon University
Causalitygraph representation learning
Z
Zhan Wang
ICT, CAS
X
Xiaohua Xu
USTC
Yuchao Zhang
Yuchao Zhang
Beijing University of Posts and Telecom
Yang Liu
Yang Liu
Professor of Beijing University of Posts and Telecommunications
AI Chips and Networks
X
Xiangrui Yang
NUDT
J
Jing Lin
Infrawaves
Xiaohe Hu
Xiaohe Hu
Tsinghua University
machine learningsystem and architecture
Y
Yang Li
Lenovo Research
C
Chao Jiang
Lenovo Research
Limin Xiao
Limin Xiao
FDU
Fiber OpticsOptoelectronics
Weifeng Zhang
Weifeng Zhang
Corp VP & Head of Intelligent Computing Lab at Lenovo Research
AI HW SW Co-DesignComputer ArchitectureHeterogeneous ComputingAI/MLGPU Optimizations