π€ AI Summary
To address the dual challenges of insufficient adversarial robustness and heterogeneous false-positive rate requirements in malicious Python package detection on PyPI, this paper proposes the first unified detection framework that jointly optimizes adversarial robustness and scenario adaptability. Methodologically, we systematically design fine-grained code obfuscation strategies to generate diverse adversarial samples, and employ adversarial training to enhance model robustness against semantics-preserving transformations. Additionally, the framework supports confidence-threshold-based calibration, enabling flexible trade-offs between precision and recall across deployment scenariosβe.g., low false-positive operation for public repositories versus high-recall enforcement for enterprise security. Evaluated on over 120,000 real-world PyPI packages, our approach achieves a 2.5Γ improvement in adversarial robustness and detects 346 previously unknown malicious packages. Daily false-positive rates are reduced to 1.24 (for repository maintainers) and 2.18 (for enterprise security teams), satisfying practical operational constraints.
π Abstract
The rise of supply chain attacks via malicious Python packages demands robust detection solutions. Current approaches, however, overlook two critical challenges: robustness against adversarial source code transformations and adaptability to the varying false positive rate (FPR) requirements of different actors, from repository maintainers (requiring low FPR) to enterprise security teams (higher FPR tolerance).
We introduce a robust detector capable of seamless integration into both public repositories like PyPI and enterprise ecosystems. To ensure robustness, we propose a novel methodology for generating adversarial packages using fine-grained code obfuscation. Combining these with adversarial training (AT) enhances detector robustness by 2.5x. We comprehensively evaluate AT effectiveness by testing our detector against 122,398 packages collected daily from PyPI over 80 days, showing that AT needs careful application: it makes the detector more robust to obfuscations and allows finding 10% more obfuscated packages, but slightly decreases performance on non-obfuscated packages.
We demonstrate production adaptability of our detector via two case studies: (i) one for PyPI maintainers (tuned at 0.1% FPR) and (ii) one for enterprise teams (tuned at 10% FPR). In the former, we analyze 91,949 packages collected from PyPI over 37 days, achieving a daily detection rate of 2.48 malicious packages with only 2.18 false positives. In the latter, we analyze 1,596 packages adopted by a multinational software company, obtaining only 1.24 false positives daily. These results show that our detector can be seamlessly integrated into both public repositories like PyPI and enterprise ecosystems, ensuring a very low time budget of a few minutes to review the false positives.
Overall, we uncovered 346 malicious packages, now reported to the community.