TrailBlazer: History-Guided Reinforcement Learning for Black-Box LLM Jailbreaking

📅 2026-02-06

📈 Citations: 0

✨ Influential: 0

career value

202K/year

🤖 AI Summary

Existing jailbreaking methods for large language models (LLMs) struggle to effectively leverage vulnerability information exposed during historical interactions, resulting in low and unstable attack efficacy. This work proposes a reinforcement learning–based black-box jailbreaking framework that, for the first time, integrates a history-aware mechanism with an attention reweighting strategy to dynamically model and emphasize critical historical vulnerability signals for guiding subsequent attack decisions. The proposed approach significantly improves both jailbreaking success rates and query efficiency, achieving state-of-the-art performance on AdvBench and HarmBench while substantially reducing the number of required queries.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) have become integral to many domains, making their safety a critical priority. Prior jailbreaking research has explored diverse approaches, including prompt optimization, automated red teaming, obfuscation, and reinforcement learning (RL) based methods. However, most existing techniques fail to effectively leverage vulnerabilities revealed in earlier interaction turns, resulting in inefficient and unstable attacks. Since jailbreaking involves sequential interactions in which each response influences future actions, reinforcement learning provides a natural framework for this problem. Motivated by this, we propose a history-aware RL-based jailbreak framework that analyzes and reweights vulnerability signals from prior steps to guide future decisions. We show that incorporating historical information alone improves jailbreak success rates. Building on this insight, we introduce an attention-based reweighting mechanism that highlights critical vulnerabilities within the interaction history, enabling more efficient exploration with fewer queries. Extensive experiments on AdvBench and HarmBench demonstrate that our method achieves state-of-the-art jailbreak performance while significantly improving query efficiency. These results underscore the importance of historical vulnerability signals in reinforcement learning-driven jailbreak strategies and offer a principled pathway for advancing adversarial research on LLM safeguards.

Problem

Research questions and friction points this paper is trying to address.

LLM jailbreaking

black-box attack

reinforcement learning

vulnerability exploitation

query efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

history-guided reinforcement learning

LLM jailbreaking

attention-based reweighting