🤖 AI Summary
Although embodied large language models undergo safety training, they may still harbor vulnerabilities, necessitating reliable red-teaming methods to assess their robustness. This work adapts the Go-Explore algorithm for red-teaming and systematically evaluates its efficacy in uncovering attacks against GPT-4o-mini through 28 multi-seed experiments. The study reveals that random seed variance overwhelmingly influences evaluation outcomes, and averaging across multiple seeds substantially reduces bias. Furthermore, simple state signatures consistently outperform more complex designs, while reward shaping frequently leads to exploration collapse (failing in 94% of runs) or generates numerous false positives (18 unverified cases). The proposed approach significantly enhances the diversity of discovered attack types, underscoring the unreliability of single-seed evaluations.
📝 Abstract
Production LLM agents with tool-using capabilities require security testing despite their safety training. We adapt Go-Explore to evaluate GPT-4o-mini across 28 experimental runs spanning six research questions. We find that random-seed variance dominates algorithmic parameters, yielding an 8x spread in outcomes; single-seed comparisons are unreliable, while multi-seed averaging materially reduces variance in our setup. Reward shaping consistently harms performance, causing exploration collapse in 94% of runs or producing 18 false positives with zero verified attacks. In our environment, simple state signatures outperform complex ones. For comprehensive security testing, ensembles provide attack-type diversity, whereas single agents optimize coverage within a given attack type. Overall, these results suggest that seed variance and targeted domain knowledge can outweigh algorithmic sophistication when testing safety-trained models.