FETTA: Flexible and Efficient Hardware Accelerator for Tensorized Neural Network Training

📅 2025-04-08

📈 Citations: 0

✨ Influential: 0

career value

190K/year

🤖 AI Summary

To address high latency and energy consumption in training tensorized neural networks (TNNs) on resource-constrained devices—caused by tensor contraction and dynamic tensor reshaping—this paper proposes a hardware-software co-designed tensor training acceleration framework. Methodologically, it introduces the first contraction sequence search engine (CSSE), which performs graph-optimization-driven automated contraction path discovery; and designs a reconfigurable contraction engine (CE) array coupled with butterfly-distributed and reduction networks to enable low-overhead online tensor shape transformation. Experimental results demonstrate that, compared to GPU and TPU baselines, the framework achieves 20.5× and 100.9× latency reduction, 567.5× and 45.03× energy reduction, and 11,609.7× and 4,544.8× improvement in energy-delay product, respectively. Against prior tensor accelerators, it delivers 3.87–14.63× speedup and 1.41–2.73× energy efficiency improvement.

Technology Category

Application Category

📝 Abstract

The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approach to compress model size without sacrificing accuracy. Nevertheless, training tensorized neural networks (TNNs) incurs non-trivial overhead and severe performance degradation on conventional accelerators due to complex tensor shaping requirements. To address these challenges, we propose FETTA, an algorithm and hardware co-optimization framework for efficient TNN training. On the algorithm side, we develop a contraction sequence search engine (CSSE) to identify the optimal contraction sequence with the minimal computational overhead. On the hardware side, FETTA features a flexible and efficient architecture equipped with a reconfigurable contraction engine (CE) array to support diverse dataflows. Furthermore, butterfly-based distribution and reduction networks are implemented to perform flexible tensor shaping operations during computation. Evaluation results demonstrate that FETTA achieves reductions of 20.5x/100.9x, 567.5x/45.03x, and 11609.7x/4544.8x in terms of processing latency, energy, and energy-delay product (EDP) over GPU and TPU, respectively. Moreover, working on the tensorized training, FETTA outperforms prior accelerators with a speedup of 3.87~14.63x, and an energy efficiency improvement of 1.41~2.73x on average.

Problem

Research questions and friction points this paper is trying to address.

Efficient on-device DNN training under resource constraints

Overcoming performance degradation in tensorized neural networks

Hardware-software co-optimization for flexible tensor operations

Innovation

Methods, ideas, or system contributions that make the work stand out.

Algorithm-hardware co-optimization for TNN training

Contraction sequence search engine minimizes overhead

Reconfigurable engine array supports diverse dataflows

🔎 Similar Papers

An Efficient Real-Time Object Detection Framework on Resource-Constricted Hardware Devices via Software and Hardware Co-design