🤖 AI Summary
This study addresses the limitations of locally deployed open-source large language models (LLMs) in automating Linux privilege escalation attacks, which currently fall short of practical penetration testing requirements. The authors present the first systematic evaluation and integration of five enhancement strategies—chain-of-thought prompting, retrieval-augmented generation, structured prompting, history compression, and reflection-based analysis—within the hackingBuddyGPT framework. Experimental results demonstrate that Llama3.1-70B successfully exploits 83% of tested vulnerabilities, while guided variants of Llama3.1-8B and Qwen2.5-7B achieve a 67% success rate, matching or even surpassing the performance of cloud-based closed-source models such as GPT-4o. The work further highlights the critical role of reflection mechanisms and identifies key bottlenecks limiting current models’ effectiveness in vulnerability exploitation.
📝 Abstract
Recent research has demonstrated the potential of Large Language Models (LLMs) for autonomous penetration testing, particularly when using cloud-based restricted-weight models. However, reliance on such models introduces security, privacy, and sovereignty concerns, motivating the use of locally hosted open-weight alternatives. Prior work shows that small open-weight models perform poorly on automated Linux privilege escalation, limiting their practical applicability.
In this paper, we present a systematic empirical study of whether targeted system-level and prompting interventions can bridge this performance gap. We analyze failure modes of open-weight models in autonomous privilege escalation, map them to established enhancement techniques, and evaluate five concrete interventions (chain-of-thought prompting, retrieval-augmented generation, structured prompts, history compression, and reflective analysis) implemented as extensions to hackingBuddyGPT.
Our results show that open-weight models can match or outperform cloud-based baselines such as GPT-4o. With our treatments enabled, Llama3.1 70B exploits 83% of tested vulnerabilities, while smaller models including Llama3.1 8B and Qwen2.5 7B achieve 67% when using guidance. A full-factorial ablation study over all treatment combinations reveals that reflection-based treatments contribute most, while also identifying vulnerability discovery as a remaining bottleneck for local models.