🤖 AI Summary
To address the challenge of efficiently vectorizing general tensor permutations across diverse tensor shapes, permutation patterns, instruction sets, and data widths, this paper proposes the first fully automatic, low-complexity SIMD code generation framework. Departing from conventional stepwise permutation approaches, our method employs a domain-specific compiler (DSL) to enable end-to-end optimal code synthesis. It introduces two key innovations: tensor shape–aware permutation decomposition and fine-grained SIMD instruction scheduling with explicit width-aware mapping. Experimental evaluation on modern wide-vector SIMD hardware demonstrates up to 38× speedup over NumPy for specific cases and an average 5× acceleration across general scenarios, significantly improving tensor reordering efficiency. Our core contribution is the first framework supporting fully automatic, cross-architecture, bit-width–aware vectorization for arbitrary permutation patterns—enabling portable, high-performance tensor layout transformations without manual intervention.
📝 Abstract
Tensor permutation is a fundamental operation widely applied in AI, tensor networks, and related fields. However, it is extremely complex, and different shapes and permutation maps can make a huge difference. SIMD permutation began to be studied in 2006, but the best method at that time was to split complex permutations into multiple simple permutations to do SIMD, which might increase the complexity for very complex permutations. Subsequently, as tensor contraction gained significant attention, researchers explored structured permutations associated with tensor contraction. Progress on general permutations has been limited, and with increasing SIMD bit widths, achieving efficient performance for these permutations has become increasingly challenging. We propose a SIMD permutation toolkit, system, that generates optimized permutation code for arbitrary instruction sets, bit widths, tensor shapes, and permutation patterns, while maintaining low complexity. In our experiments, system is able to achieve up to $38 imes$ speedup for special cases and $5 imes$ for general gases compared to Numpy.