LLMs Corrupt Your Documents When You Delegate

📅 2026-04-16

📈 Citations: 0

✨ Influential: 0

career value

195K/year

🤖 AI Summary

This work addresses the critical yet underexplored issue of silent content corruption introduced by large language models (LLMs) during long-horizon, delegated document editing tasks. To systematically evaluate reliability in such workflows, the authors propose the first assessment framework tailored to delegated AI interactions and introduce DELEGATE-52, a benchmark spanning 52 professional domains. Through comprehensive evaluation of 19 state-of-the-art LLMs, the study reveals that even leading models—such as GPT-5.4 and Claude-4.6 Opus—silently corrupt an average of 25% of document content after extended editing sequences. Moreover, existing agent-based mitigation tools prove ineffective, uncovering a distinct degradation pattern wherein errors remain sparse yet highly destructive, intensifying with both document scale and workflow length.

Technology Category

Application Category

📝 Abstract

Large Language Models (LLMs) are poised to disrupt knowledge work, with the emergence of delegated work as a new interaction paradigm (e.g., vibe coding). Delegation requires trust - the expectation that the LLM will faithfully execute the task without introducing errors into documents. We introduce DELEGATE-52 to study the readiness of AI systems in delegated workflows. DELEGATE-52 simulates long delegated workflows that require in-depth document editing across 52 professional domains, such as coding, crystallography, and music notation. Our large-scale experiment with 19 LLMs reveals that current models degrade documents during delegation: even frontier models (Gemini 3.1 Pro, Claude 4.6 Opus, GPT 5.4) corrupt an average of 25% of document content by the end of long workflows, with other models failing more severely. Additional experiments reveal that agentic tool use does not improve performance on DELEGATE-52, and that degradation severity is exacerbated by document size, length of interaction, or presence of distractor files. Our analysis shows that current LLMs are unreliable delegates: they introduce sparse but severe errors that silently corrupt documents, compounding over long interaction.

Problem

Research questions and friction points this paper is trying to address.

Large Language Models

delegated workflows

document corruption

error propagation

AI reliability

Innovation

Methods, ideas, or system contributions that make the work stand out.

delegated workflows

document corruption

LLM reliability