Scaling Data Center TCP to Terabits with Laminar

📅 2025-04-27
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Achieving concurrent optimization of high performance, low latency, low power consumption, and standards compliance remains challenging for terabit-scale programmable switches and SmartNICs. Method: This paper introduces Laminar, the first RMT-native TCP stack, which restructures the TCP state machine into a match-action pipeline and introduces three novel hardware mechanisms—optimistic concurrency, pseudo-segment updates, and cut-through processing—to fully offload retransmission, reassembly, flow control, and congestion control in hardware. Implemented on Intel Tofino2, Laminar supports POSIX sockets, customizable congestion algorithms, and a linearized log-sequenced API. Results: It achieves 25 Mpkt/s (exceeding 1.6 Tbps at 8-KB MTU) per core, reduces tail latency by 5×, improves throughput by 1.3×, saves 16 CPU cores versus software stacks, doubles energy efficiency for KV storage, and maintains protocol extensibility.

Technology Category

Application Category

📝 Abstract
Laminar is the first TCP stack designed for the reconfigurable match-action table (RMT) architecture, widely used in high-speed programmable switches and SmartNICs. Laminar reimagines TCP processing as a pipeline of simple match-action operations, enabling line-rate performance with low latency and minimal energy consumption, while maintaining compatibility with standard TCP and POSIX sockets. Leveraging novel techniques like optimistic concurrency, pseudo segment updates, and bump-in-the-wire processing, Laminar handles the transport logic, including retransmission, reassembly, flow, and congestion control, entirely within the RMT pipeline. We prototype Laminar on an Intel Tofino2 switch and demonstrate its scalability to terabit speeds, its flexibility, and robustness to network dynamics. Laminar reaches an unprecedented 25M pkts/sec with a single host core for streaming workloads, enough to exceed 1.6Tbps with 8K MTU. Laminar delivers RDMA-equivalent performance, saving up to 16 host CPU cores versus the TAS kernel-bypass TCP stack with short RPC workloads, while achieving 1.3$ imes$ higher peak throughput at 5$ imes$ lower 99.99p tail latency. A key-value store on Laminar doubles the throughput-per-watt versus TAS. Demonstrating Laminar's flexibility, we implement TCP stack extensions, including a sequencer API for a linearizable distributed shared log, a new congestion control protocol, and delayed ACKs.
Problem

Research questions and friction points this paper is trying to address.

Designing TCP for high-speed programmable switches
Achieving line-rate performance with low energy
Scaling TCP to terabit speeds efficiently
Innovation

Methods, ideas, or system contributions that make the work stand out.

TCP stack for RMT architecture in switches
Pipeline of match-action operations for efficiency
Optimistic concurrency and pseudo segment updates
🔎 Similar Papers
No similar papers found.