π€ AI Summary
This work addresses the scarcity and high cost of high-quality, diverse multi-turn dialogue data for post-training large language models, particularly in low-resource domains. To tackle this challenge, the authors propose an adversarial arenaβbased data generation framework that reframes data construction as a multi-agent competitive interaction: an attacker designs challenging prompts while a defender generates aligned responses. Integrated with a crowdsourcing incentive mechanism, this approach autonomously produces dialogues exhibiting high difficulty and diversity. Focusing on safety alignment in the cybersecurity domain, the project generated 19,683 multi-turn dialogues. Models fine-tuned on this dataset demonstrate significant improvements in secure code generation, achieving performance gains of 18.47% and 29.42% on the CyberSecEval-Instruct and CyberSecEval-MITRE benchmarks, respectively.
π Abstract
Post-training Large Language Models requires diverse, high-quality data which is rare and costly to obtain, especially in low resource domains and for multi-turn conversations. Common solutions are crowdsourcing or synthetic generation, but both often yield low-quality or low-diversity data. We introduce Adversarial Arena for building high quality conversational datasets by framing data generation as an adversarial task: attackers create prompts, and defenders generate responses. This interactive competition between multiple teams naturally produces diverse and complex data. We validated this approach by conducting a competition with 10 academic teams from top US and European universities, each building attacker or defender bots. The competition, focused on safety alignment of LLMs in cybersecurity, generated 19,683 multi-turn conversations. Fine-tuning an open-source model on this dataset produced an 18.47% improvement in secure code generation on CyberSecEval-Instruct and 29.42% improvement on CyberSecEval-MITRE.