🤖 AI Summary
This work addresses the absence of a fair, verifiable, and evidence-based evaluation framework for human–AI collaboration in current cybersecurity competitions. It proposes the first formal taxonomy of AI collaboration autonomy levels in Capture-the-Flag (CTF) settings—namely, human-in-the-loop, autonomous agents, and hybrid modes—and implements a large language model–based competition system integrating tool-augmented prompting, agent trajectory tracking, dialogue log analysis, and reflection-based retry mechanisms. Leveraging multi-regional competition data, the study introduces a traceable submission protocol and a phased track structure. Empirical results demonstrate that autonomous and hybrid modes achieve higher completion rates on challenges requiring iterative testing and tool interaction, while classroom participants express a preference for lightweight prompt augmentation over complex multi-agent architectures.
📝 Abstract
Large language models are rapidly changing how learners acquire and demonstrate cybersecurity skills. However, when human--AI collaboration is allowed, educators still lack validated competition designs and evaluation practices that remain fair and evidence-based. This paper presents a cross-regional study of LLM-centered Capture-the-Flag competitions built on the Cyber Security Awareness Week competition system. To understand how autonomy levels and participants' knowledge backgrounds influence problem-solving performance and learning-related behaviors, we formalize three autonomy levels: human-in-the-loop, autonomous agent frameworks, and hybrid. To enable verification, we require traceable submissions including conversation logs, agent trajectories, and agent code. We analyze multi-region competition data covering an in-class track, a standard track, and a year-long expert track, each targeting participants with different knowledge backgrounds. Using data from the 2025 competition, we compare solve performance across autonomy levels and challenge categories, and observe that autonomous agent frameworks and hybrid achieve higher completion rates on challenges requiring iterative testing and tool interactions. In the in-class track, we classify participants' agent designs and find a preference for lightweight, tool-augmented prompting and reflection-based retries over complex multi-agent architectures. Our results offer actionable guidance for designing LLM-assisted cybersecurity competitions as learning technologies, including autonomy-specific scoring criteria, evidence requirements that support solution verification, and track structures that improve accessibility while preserving reliable evaluation and engagement.