Hail to the Thief: Exploring Attacks and Defenses in Decentralised GRPO

📅 2025-11-12
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a novel security threat in decentralized Group Relative Policy Optimization (GRPO): malicious nodes can inject arbitrary adversarial tokens—either contextually or contextually agnostic—to poison the post-training process of large language models (LLMs). Method: We formally model and empirically validate this attack vector, demonstrating near-100% success rates within just 50 optimization iterations on mathematical and programming tasks. To counter it, we propose a dual-mode defense: (i) string-level gradient exchange constraints for homogeneous models, and (ii) consensus-based malicious token filtering for heterogeneous models. Contribution/Results: Our framework achieves 100% attack mitigation and establishes the first verifiable, highly robust security foundation for decentralized alignment training—enabling provably secure, collaborative LLM optimization without centralized oversight.

Technology Category

Application Category

📝 Abstract
Group Relative Policy Optimization (GRPO) has demonstrated great utilization in post-training of Large Language Models (LLMs). In GRPO, prompts are answered by the model and, through reinforcement learning, preferred completions are learnt. Owing to the small communication volume, GRPO is inherently suitable for decentralised training as the prompts can be concurrently answered by multiple nodes and then exchanged in the forms of strings. In this work, we present the first adversarial attack in decentralised GRPO. We demonstrate that malicious parties can poison such systems by injecting arbitrary malicious tokens in benign models in both out-of-context and in-context attacks. Using empirical examples of math and coding tasks, we show that adversarial attacks can easily poison the benign nodes, polluting their local LLM post-training, achieving attack success rates up to 100% in as few as 50 iterations. We propose two ways to defend against these attacks, depending on whether all users train the same model or different models. We show that these defenses can achieve stop rates of up to 100%, making the attack impossible.
Problem

Research questions and friction points this paper is trying to address.

Identifies adversarial attacks in decentralized GRPO training
Demonstrates malicious token injection in LLM post-training
Proposes defenses against poisoning attacks in collaborative learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Adversarial attacks poison decentralized GRPO training
Defenses achieve 100% stop rates against attacks
Methods vary for same or different user models
🔎 Similar Papers
No similar papers found.
N
Nikolay Blagoev
Gensyn, University of Neuchatel
Oğuzhan Ersoy
Oğuzhan Ersoy
Research Lead, Gensyn
Distributed MLBlockchainApplied CryptographyAI Security and Privacy
L
Lydia Yiyu Chen
University of Neuchatel, TU Delft