🤖 AI Summary
The SLSQP algorithm suffers from performance bottlenecks due to QR decomposition’s high sensitivity to memory access patterns and intermediate result storage.
Method: This paper proposes a state-dependent iterative back-substitution task-graph scheduling method, introducing a novel dual-queue scheduling paradigm that—within DAG-based scheduling—explicitly ensures both accessibility and cross-iteration reuse of intermediate kernels in QR decomposition. A high-order C++ task-graph framework is developed, integrating compiler optimizations, memory-aware scheduling, and fine-grained dependency modeling.
Contribution/Results: Experimental evaluation demonstrates a 10× speedup in overall SLSQP convergence time over serial QR implementations, significantly improving parallel efficiency and scalability for nonlinear programming problems.
📝 Abstract
Efficient task scheduling is paramount in parallel programming on multi-core architectures, where tasks are fundamental computational units. QR factorization is a critical sub-routine in Sequential Least Squares Quadratic Programming (SLSQP) for solving non-linear programming (NLP) problems. QR factorization decomposes a matrix into an orthogonal matrix Q and an upper triangular matrix R, which are essential for solving systems of linear equations arising from optimization problems. SLSQP uses an in-place version of QR factorization, which requires storing intermediate results for the next steps of the algorithm. Although DAG-based approaches for QR factorization are prevalent in the literature, they often lack control over the intermediate kernel results, providing only the final output matrices Q and R. This limitation is particularly challenging in SLSQP, where intermediate results of QR factorization are crucial for back-substitution logic at each iteration. Our work introduces novel scheduling techniques using a two-queue approach to execute the QR factorization kernel effectively. This approach, implemented in high-level C++ programming language, facilitates compiler optimizations and allows storing intermediate results required by back-substitution logic. Empirical evaluations demonstrate substantial performance gains, including a 10x improvement over the sequential QR version of the SLSQP algorithm.