🤖 AI Summary
This study addresses the integrity threat posed by third-party relays in Bring-Your-Own-Key (BYOK) architectures, where aligned large language models (LLMs) have their responses maliciously altered post-alignment. The work formally introduces Relay Tampering Attacks (RTA), which strategically rewrite model outputs through multi-round manipulation, minimal critical edits, and stealthy recovery to bypass alignment safeguards while preserving proxy functionality. As the first systematic investigation into the vulnerability of post-alignment response pathways, this paper develops an effective RTA framework and proposes a novel defense based on timestamp anomaly detection. Experimental results demonstrate that RTA achieves up to 99.1% success rates across six mainstream LLMs on the AgentDojo and ASB benchmarks, substantially outperforming prompt injection baselines and evading all four state-of-the-art defenses evaluated.
📝 Abstract
Bring-Your-Own-Key (BYOK) agent architectures let users route LLM traffic through third-party relays, creating a critical integrity gap: a malicious relay can modify an aligned LLM response after generation but before agent execution. We formalize this post-alignment tampering threat and show that, without end-to-end integrity, the relay can observe, suppress, or replace downstream messages, making even perfectly aligned LLMs ineffective against such attacks. We instantiate this threat as the Relay Tampering Attack (RTA), which performs multi-round strategic rewriting, minimal security-critical edits, and stealth restoration by resubmitting tampered outputs to the upstream LLM. Across AgentDojo and ASB with six LLMs, RTA achieves up to 99.1% attack success, outperforming prompt-injection baselines with modest overhead. Case studies on OpenClaw and Claude Code demonstrate real-world feasibility, and evaluations of four defenses show that none fully prevent RTA. Finally, we propose a time-based detection defense that mitigates RTA while preserving agent utility.