🤖 AI Summary
This work addresses the challenge of deploying in-network collective communication (INC) in open Ethernet ecosystems, where cross-layer complexity hinders practical adoption. To overcome this, the authors propose EPIC, a protocol grounded in the principle of “unified abstraction with polymorphic implementation.” EPIC defines an INC interface compatible with standard Ethernet and supports multiple implementation paths tailored to diverse hardware capabilities. It introduces a modular INC architecture enabling incremental hardware evolution, employs formal verification to ensure correctness across polymorphic implementations, and incorporates a generic resource management model adaptable to varied deployment scenarios. Comprehensive evaluation—spanning model checking, simulation, virtualized emulation, and hardware validation on Tofino and FPGA platforms—demonstrates that EPIC achieves significant improvements in functional correctness, performance, and real-world deployability.
📝 Abstract
In-Network Collective (INC) acceleration holds immense potential for optimizing AI training and inference; however, its cross-layer nature has historically hindered investment and adoption within the open Ethernet ecosystem. To bridge this gap, we propose EPIC (Ethernet Polymorphic In-network Collective), an INC protocol specification and reference system built on the principle of "Unified Abstraction, Polymorphic Realization." EPIC introduces an abstraction compatible with standard Ethernet that aligns functional boundaries with participant roles, while offering polymorphic realizations tailored to varying hardware capabilities.
We address three fundamental challenges: first, we employ a modular design that enables an evolutionary path from simple to complex implementations, allowing vendors to iterate their hardware incrementally; second, we apply formal verification methodologies to prove the correctness of all proposed polymorphic modes; and third, we develop a unified resource management model versatile enough for diverse INC scenarios. Extensive validation -- spanning model checking, packet/flow simulations, VM emulation, Tofino Testbed, and FPGA/RTL verification -- confirms EPIC's correctness, performance gain, and feasibility.