Post
The Agent Didn't Delete Your File. It Sanded It Down.
A new DELEGATE-52 benchmark says long AI editing sessions quietly corrupt documents. The useful lesson is not 'never delegate' — it is 'make every edit inspectable.'
So here's a paper title that sounds like a bug report from the future: LLMs Corrupt Your Documents When You Delegate.
Not "LLMs sometimes hallucinate." We know that one. Not "LLMs write weird prose." Also known.
This one is more annoying. The claim is that when you hand an AI a real document and ask it to keep editing over a long workflow, the model can quietly damage the artifact while doing the requested work. A missing field here. A changed number there. A broken reference in the corner. Not a dramatic explosion. More like sanding a table until one leg is shorter.
That matters because "delegate the whole thing to the agent" is exactly where every product demo is pointing.
What DELEGATE-52 tests
The researchers built a benchmark called DELEGATE-52. The shape is simple:
| Part | What it means |
|---|---|
| 52 domains | Real document formats: code, accounting ledgers, calendars, music notation, crystallography files, subtitles, and more |
| Long workflows | The model performs repeated edits, not one isolated transformation |
| Document scoring | Each domain has evaluators that compare whether the document still preserves the expected content |
| 19 models | The experiment spans multiple model families, including frontier models |
The finding that grabbed everyone: by the end of long workflows, even frontier models corrupted about a quarter of document content on average in the paper's setup. Weaker models were worse.
That number is scary, but the more interesting detail is the failure mode. The errors were described as sparse but severe. In normal human language: the model does not necessarily ruin the whole file. It makes a few changes that are easy to miss and hard to forgive.
This is why the benchmark feels relevant beyond academia. Most people do not inspect every character after asking an assistant to "clean this up" or "reorganize this file." They check whether the output looks plausible. Plausible is exactly where silent corruption hides.
Delegation is not generation
The post-ChatGPT era trained us to ask: "Can the model produce a good answer?"
Delegation asks a different question: "Can the model preserve what already exists while making only the right changes?"
Those are not the same skill.
Generation is like asking someone to cook dinner from scratch. Delegation is asking them to renovate your kitchen while keeping the plumbing, wiring, permits, receipts, and family calendar intact. The second job has more ways to fail quietly.
That is why this paper lands at the right moment. We have spent a year pushing AI from chat into work surfaces:
- coding agents that edit repositories,
- office copilots that update documents,
- research assistants that manipulate notes,
- workflow agents that touch spreadsheets, tickets, PDFs, calendars, and email.
All of those systems need a boring superpower: do not damage the thing you were asked to help with.
The fair objection
There is a fair technical pushback here.
Some of the Hacker News discussion focused on the benchmark's tool harness. The agentic setup was not a highly optimized production coding agent with surgical edit tools, diff previews, typed transformations, tests, linters, rollback, and review gates. Simon Willison argued that a better harness could likely do better.
I buy that.
But I do not think it makes the result useless. It changes what lesson we should take from it.
The weak conclusion is: "Models are bad, never delegate."
The stronger conclusion is: raw model delegation is not a product architecture.
If your workflow is "give the model the whole file, ask for edits, accept the rewritten file," you are asking for document drift. A good agent product should behave less like a novelist rewriting the whole chapter and more like a careful editor with track changes on.
The practical lesson
If you are building or buying AI tools, the question is not only "which model is smartest?" Ask about the edit surface:
| Bad smell | Better pattern |
|---|---|
| Whole-document rewrites for small changes | Surgical patches or structured operations |
| No visible diff | Mandatory before/after review |
| No domain validation | Parsers, tests, schemas, linters, or semantic checks |
| No rollback | Version history and restore points |
| One long context soup | Smaller files, explicit references, scoped tasks |
| "Trust me" automation | Human approval for high-value artifacts |
This is especially true outside code. Software has a cheat code: tests, compilers, type systems, git diffs. A legal memo, a music score, a financial ledger, or a slide deck often has fewer automatic alarms. The file can be wrong and still look polished.
That is the dangerous zone.
My take
DELEGATE-52 is not proof that agents are doomed. It is proof that "the model is smart" is not enough.
The agent era needs infrastructure that treats preservation as a first-class requirement. Diff everything. Validate formats. Keep old versions. Prefer commands over rewrites. Make the agent explain what changed. Assume long workflows accumulate dust unless something is sweeping.
The funny part is that developers already learned this lesson the painful way. We use version control because humans corrupt documents too. We use tests because confidence is not evidence. We review diffs because "looks fine" is how bugs enter production wearing a clean shirt.
AI does not remove those habits. It makes them more important.
The "amigo nerd" verdict: delegate drafts, experiments, and low-risk cleanup freely. For anything valuable, make the agent work through a diff. If the tool cannot show you exactly what changed, it is not an assistant yet. It is a blender with a save button.
Sources
- LLMs Corrupt Your Documents When You Delegate — the arXiv paper introducing DELEGATE-52 and reporting the long-workflow corruption results
- microsoft/delegate52 dataset — public benchmark dataset release with released work environments and domains
- microsoft/delegate52 code repository — accompanying code for running relay simulations and inspecting the benchmark harness
- Hacker News discussion — useful technical pushback on the benchmark harness and what production-grade editing tools might change