StreamDCIM: A Tile-based Streaming Digital CIM Accelerator with Mixed-stationary Cross-forwarding Dataflow for Multimodal Transformer

📅 2025-02-09
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing digital compute-in-memory (CIM) accelerators suffer from limited microarchitectural flexibility, inflexible dataflows, and coarse-grained pipelining, hindering efficient execution of multimodal Transformer models. To address this, we propose a streaming digital CIM accelerator architecture. Our key contributions are: (1) tile-level reconfigurable CIM macros supporting both conventional and hybrid computation modes; (2) a mixed-resident cross-FFN dataflow that enhances data reuse and memory access efficiency; and (3) a ping-pong fine-grained compute-rewriting pipeline enabling deep overlap between computation rewriting and execution. Evaluated on representative multimodal Transformer models, our design achieves geometric mean speedups of 2.63× and 1.28× over non-streaming and layer-streaming CIM baselines, respectively, while significantly improving energy efficiency and throughput.

Technology Category

Application Category

📝 Abstract
Multimodal Transformers are emerging artificial intelligence (AI) models designed to process a mixture of signals from diverse modalities. Digital computing-in-memory (CIM) architectures are considered promising for achieving high efficiency while maintaining high accuracy. However, current digital CIM-based accelerators exhibit inflexibility in microarchitecture, dataflow, and pipeline to effectively accelerate multimodal Transformer. In this paper, we propose StreamDCIM, a tile-based streaming digital CIM accelerator for multimodal Transformers. It overcomes the above challenges with three features: First, we present a tile-based reconfigurable CIM macro microarchitecture with normal and hybrid reconfigurable modes to improve intra-macro CIM utilization. Second, we implement a mixed-stationary cross-forwarding dataflow with tile-based execution decoupling to exploit tile-level computation parallelism. Third, we introduce a ping-pong-like fine-grained compute-rewriting pipeline to overlap high-latency on-chip CIM rewriting. Experimental results show that StreamDCIM outperforms non-streaming and layer-based streaming CIM-based solutions by geomean 2.63$ imes$ and 1.28$ imes$ on typical multimodal Transformer models.
Problem

Research questions and friction points this paper is trying to address.

StreamDCIM accelerates multimodal Transformers
Improves CIM microarchitecture flexibility
Enhances computation parallelism with mixed-stationary dataflow
Innovation

Methods, ideas, or system contributions that make the work stand out.

Tile-based reconfigurable CIM macro
Mixed-stationary cross-forwarding dataflow
Ping-pong-like compute-rewriting pipeline
🔎 Similar Papers
No similar papers found.
Shantian Qin
Shantian Qin
Institute of Computing Technology, Chinese Academy of Sciences
AI ChipReconfigurable ComputingComputing-in-MemoryComputer Architecture
Z
Ziqing Qiang
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
Z
Zhihua Fan
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
W
Wenming Li
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
X
Xuejun An
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
X
Xiaochun Ye
State Key Lab of Processors, Institute of Computing Technology, Chinese Academy of Sciences, Beijing, China; School of Computer Science and Technology, University of Chinese Academy of Sciences, Beijing, China
Dongrui Fan
Dongrui Fan
Institute of Computing Technology, Chinese Academy of Sciences
Computer ArchitectureProcessor DesignMany-core Design