🤖 AI Summary
This work addresses the challenge of credit assignment in multi-turn jailbreaking attacks, where existing reinforcement learning methods rely on trajectory-level rewards and thus fail to accurately assess the contribution of individual dialogue turns to attack success. To resolve this, the authors propose TRACE, a framework featuring a turn-aware, fine-grained credit assignment mechanism. For successful trajectories, it estimates each turn’s contribution via leave-one-out semantic masking; for failed ones, it imposes penalties based on harmfulness, semantic relevance, and local refusal signals. Notably, the attack-side credit signals are also transferred to enhance defensive alignment. The study reveals, for the first time, the non-uniformity, phase dependency, and target specificity of turn contributions in multi-turn jailbreaking. Experiments demonstrate that TRACE improves attack success rates by approximately 25% over the strongest RL baselines on both open- and closed-source large language models, while also achieving a better safety–utility trade-off in defense.
📝 Abstract
Deploying LLMs in multi-turn dialogues facilitates jailbreak attacks that distribute harmful intent across seemingly benign turns. Recent training-based multi-turn jailbreak methods learn long-horizon attack strategies from interaction feedback, but often rely on coarse trajectory-level outcome signals that broadcast uniformly to every turn. However, we find that turn-level contributions in multi-turn jailbreaking are non-uniform, phase-dependent, and target-specific. Such coarse outcome supervision induces a credit assignment problem, leading to over-rewarding redundant turns in successful trajectories and under-crediting useful intermediate turns in failed ones. To address this, we propose TRACE, a turn-aware credit assignment framework for reinforcement learning (RL)-based multi-turn jailbreaking. For successful trajectories, TRACE estimates turn-level contributions via leave-one-turn-out semantic masking; for failed ones, TRACE assigns penalties based on prompt harmfulness and semantic relevance, with an additional local refusal-aware penalty. Furthermore, we reuse the attack-side credit signal for multi-turn defense alignment. Extensive experiments on open-source and closed-source targets show that TRACE achieves strong overall performance in effectiveness, transferability, and efficiency, yielding about a 25% relative improvement in attack success rate over the strongest RL baseline while also improving the safety-utility balance when reused for defense alignment.