π€ AI Summary
Existing tensor decomposition-based rank selection for embedded devices relies heavily on manual trial-and-error or incurs prohibitive computational overhead from automatic optimization. To address this, we propose a software-hardware co-designed real-time object detection framework. Our approach uniquely integrates Tensor Train (TT) decomposition with FPGA acceleration in a deeply coupled manner, enabling joint optimization of model compression ratio and hardware execution efficiency. Specifically, we apply TT decomposition to compress YOLOv5, design a custom FPGA accelerator, and perform software-hardware co-compiled optimizations. Evaluated on Jetson Nano and Xilinx Zynq FPGA platforms, the framework achieves 68% model size reduction, 3.2Γ inference speedup, and end-to-end latency under 32 msβwhile preserving high detection accuracy. This work establishes a scalable, co-design paradigm for efficient, lightweight vision models at the edge.
π Abstract
The fast development of object detection techniques has attracted attention to developing efficient Deep Neural Networks (DNNs). However, the current state-of-the-art DNN models can not provide a balanced solution among accuracy, speed, and model size. This paper proposes an efficient real-time object detection framework on resource-constricted hardware devices through hardware and software co-design. The Tensor Train (TT) decomposition is proposed for compressing the YOLOv5 model. By unitizing the unique characteristics given by the TT decomposition, we develop an efficient hardware accelerator based on FPGA devices. Experimental results show that the proposed method can significantly reduce the model size and improve the execution time.