🤖 AI Summary
To address high latency and energy consumption in training tensorized neural networks (TNNs) on resource-constrained devices—caused by tensor contraction and dynamic tensor reshaping—this paper proposes a hardware-software co-designed tensor training acceleration framework. Methodologically, it introduces the first contraction sequence search engine (CSSE), which performs graph-optimization-driven automated contraction path discovery; and designs a reconfigurable contraction engine (CE) array coupled with butterfly-distributed and reduction networks to enable low-overhead online tensor shape transformation. Experimental results demonstrate that, compared to GPU and TPU baselines, the framework achieves 20.5× and 100.9× latency reduction, 567.5× and 45.03× energy reduction, and 11,609.7× and 4,544.8× improvement in energy-delay product, respectively. Against prior tensor accelerators, it delivers 3.87–14.63× speedup and 1.41–2.73× energy efficiency improvement.
📝 Abstract
The increasing demand for on-device training of deep neural networks (DNNs) aims to leverage personal data for high-performance applications while addressing privacy concerns and reducing communication latency. However, resource-constrained platforms face significant challenges due to the intensive computational and memory demands of DNN training. Tensor decomposition emerges as a promising approach to compress model size without sacrificing accuracy. Nevertheless, training tensorized neural networks (TNNs) incurs non-trivial overhead and severe performance degradation on conventional accelerators due to complex tensor shaping requirements. To address these challenges, we propose FETTA, an algorithm and hardware co-optimization framework for efficient TNN training. On the algorithm side, we develop a contraction sequence search engine (CSSE) to identify the optimal contraction sequence with the minimal computational overhead. On the hardware side, FETTA features a flexible and efficient architecture equipped with a reconfigurable contraction engine (CE) array to support diverse dataflows. Furthermore, butterfly-based distribution and reduction networks are implemented to perform flexible tensor shaping operations during computation. Evaluation results demonstrate that FETTA achieves reductions of 20.5x/100.9x, 567.5x/45.03x, and 11609.7x/4544.8x in terms of processing latency, energy, and energy-delay product (EDP) over GPU and TPU, respectively. Moreover, working on the tensorized training, FETTA outperforms prior accelerators with a speedup of 3.87~14.63x, and an energy efficiency improvement of 1.41~2.73x on average.