Streaming Tensor Program: A streaming abstraction for dynamic parallelism

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

Existing spatial dataflow accelerators employ programming abstractions ill-suited for dynamic tensor behaviors—such as dynamic shapes, ragged inputs, and data-dependent control flow—resulting in limited visibility and flexibility for performance-critical decisions. This work introduces STeP, a streaming tensor programming abstraction tailored for spatial dataflow architectures. STeP enables fine-grained modeling and optimization of dynamic parallelism via symbolic shape semantics, flexible routing operators, and explicit memory hierarchy management. Its core innovations include dynamic tiling, dynamic parallelization, and compile-time time-multiplexing—explicitly exposing variations in data rates and tensor dimensions. Evaluated on real LLM layer traces, STeP reduces on-chip memory requirements by 2.18×, improves latency by 1.5×, and increases computational utilization by 2.57×.

Technology Category

Application Category

📝 Abstract

Dynamic behaviors are becoming prevalent in many tensor applications. In machine learning, for example, the input tensors are dynamically shaped or ragged, and data-dependent control flow is widely used in many models. However, the limited expressiveness of prior programming abstractions for spatial dataflow accelerators forces the dynamic behaviors to be implemented statically or lacks the visibility for performance-critical decisions. To address these challenges, we present the Streaming Tensor Program (STeP), a new streaming abstraction that enables dynamic tensor workloads to run efficiently on spatial dataflow accelerators. STeP introduces flexible routing operators, an explicit memory hierarchy, and symbolic shape semantics that expose dynamic data rates and tensor dimensions. These capabilities unlock new optimizations-dynamic tiling, dynamic parallelization, and configuration time-multiplexing-that adapt to dynamic behaviors while preserving dataflow efficiency. Using a cycle-approximate simulator on representative LLM layers with real-world traces, dynamic tiling reduces on-chip memory requirement by 2.18x, dynamic parallelization improves latency by 1.5x, and configuration time-multiplexing improves compute utilization by 2.57x over implementations available in prior abstractions.

Problem

Research questions and friction points this paper is trying to address.

Addressing limited expressiveness for dynamic tensor workloads

Enabling efficient execution on spatial dataflow accelerators

Optimizing memory usage and compute utilization dynamically

Innovation

Methods, ideas, or system contributions that make the work stand out.

Introduces flexible routing operators for dynamic data

Establishes explicit memory hierarchy for tensor workloads

Uses symbolic shape semantics for dynamic optimizations

🔎 Similar Papers

Galley: Modern Query Optimization for Sparse Tensor Programs