StreamTensor: Make Tensors Stream in Dataflow Accelerators for LLMs

📅 2025-09-17
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address memory bottlenecks, inter-core dependency modeling challenges, and inefficient off-chip memory accesses in large language model (LLM) inference on dataflow accelerators, this paper proposes a compilation optimization methodology based on an iterative tensor type system. It explicitly encodes streaming tensor layouts to enable cross-core kernel fusion, coordinated buffer optimization, and fine-grained memory management. The method systematically explores an optimal implementation within a three-level design space comprising tensor tiling, kernel fusion strategies, and resource allocation. Implemented as an FPGA compiler framework, it achieves up to 36% lower latency than state-of-the-art FPGA accelerators on mainstream LLM workloads and delivers 1.99× higher energy efficiency compared to GPUs. The approach significantly improves execution efficiency and hardware adaptability for large-scale tensor computations on dataflow architectures.

Technology Category

Application Category

📝 Abstract
Efficient execution of deep learning workloads on dataflow architectures is crucial for overcoming memory bottlenecks and maximizing performance. While streaming intermediate results between computation kernels can significantly improve efficiency, existing approaches struggle with inter-kernel correlations, external memory access management, and buffer optimization. In this work, we propose StreamTensor, a compiler framework that automatically constructs and optimizes stream-based dataflow accelerators. StreamTensor introduces a novel iterative tensor type system to explicitly encode stream layouts, enabling seamless kernel fusion, buffer allocation, and memory optimization. By systematically exploring three hierarchical design spaces, including tensor tiling, kernel fusion, and resource allocation, StreamTensor balances computational intensity, memory efficiency, and data streaming to maximize performance. Based on FPGA evaluations on Large Language Models (LLM), StreamTensor achieves up to 0.76x and 0.64x lower latency compared to the state-of-the-art FPGA LLM accelerators and GPUs, and up to 1.99x higher energy efficiency compared to GPUs, making it a promising approach for scalable dataflow-based deep learning acceleration.
Problem

Research questions and friction points this paper is trying to address.

Optimizes tensor streaming in dataflow accelerators for LLMs
Addresses inter-kernel correlations and memory access bottlenecks
Enables efficient kernel fusion and buffer allocation optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Compiler framework for stream-based dataflow accelerators
Novel iterative tensor type system for layouts
Hierarchical design space exploration for optimization
🔎 Similar Papers
No similar papers found.