FragBench: Cross-Session Attacks Hidden in Benign-Looking Fragments

📅 2026-05-10
📈 Citations: 0
Influential: 0
📄 PDF

career value

209K/year
🤖 AI Summary
This work addresses the limitation of current large language model safety evaluations in detecting fragmented attacks—malicious sequences that appear harmless in isolation and lack shared context across sessions but become harmful when combined. To tackle this, the authors introduce FragBench, a benchmark derived from 24 real-world cyberattack incidents, which decomposes attack chains into seemingly benign fragments. They propose a dual-task framework: an adversarial rewriter generates malicious fragments capable of evading single-turn moderation, while a graph neural network (GNN) models user-level interaction graphs to identify cross-session malicious behavior. Experiments demonstrate that both GNNs and conventional machine learning models effectively capture cross-session patterns, achieving event-level F1 scores of 0.88–0.96. This study is the first to systematically define and evaluate the fragmented attack paradigm, underscoring the necessity of moving beyond isolated prompts toward user behavior graph modeling for robust defense.
📝 Abstract
An attacker can split a malicious goal into sub-prompts that each look benign on their own and only become harmful in combination. Existing LLM safety benchmarks evaluate prompts one at a time, or across turns of a single chat, and so do not look for a malicious signal spread across separate sessions with no shared context. We build FragBench, a benchmark drawn from 24 real-world cyber-incident campaigns, which keeps the full attack trail: the multi-fragment kill chain, the per-fragment safety-judge verdicts, sandboxed execution traces, and a matched set of benign cover sessions. FragBench splits this trail into two paired tasks: an adversarial rewriter that hardens fragments against a single-turn safety judge (FragBench Attack), and a graph-based user-level detector trained on the resulting interactions (FragBench Defense). The single-turn judge is near chance on the released corpus by construction, but four GNN variants and three classical-ML baselines all recover the cross-session feature, reaching aggregate event-level F1 = 0.88-0.96. Defending against fragmented LLM misuse therefore requires modeling the cross-session interaction graph, rather than isolated prompts. Our generator, rewriter, sandbox harness, and detector are released at https://github.com/LidaSafety/fragbench.
Problem

Research questions and friction points this paper is trying to address.

cross-session attacks
fragmented prompts
LLM safety
adversarial fragmentation
multi-session malicious behavior
Innovation

Methods, ideas, or system contributions that make the work stand out.

cross-session attack
fragmented prompting
graph-based detection
LLM safety benchmark
adversarial rewriting