Claude Code Wasn't Nerfed. It Was Sick — Here's the Diagnosis

If you've been using Claude Code in the last month and felt like it was getting dumber, you weren't imagining it. And you weren't wrong.

Anthropic just published a postmortem admitting that three separate bugs degraded Claude Code between early March and late April. Zero of them were in the underlying model. All three were in the scaffolding around it — the harness, the prompt, the caching layer.

This is probably the most honest bug report a frontier AI lab has published. Let's walk through it.

The short version

Users across Reddit, X, and GitHub had been saying the same thing for weeks: Claude Code felt less sharp, more forgetful, weirdly terse. Some called it "AI shrinkflation." Stella Laurenzo from AMD even ran an audit on 6,852 session files showing measurable regression.

Anthropic investigated, found three unrelated bugs stacked on top of each other, shipped fixes, and reset everyone's usage limits as a mea culpa. Here's the anatomy:

#	Bug	Active	Fixed
1	Reasoning effort quietly downgraded from `high` to `medium`	Mar 4	Apr 7
2	Thinking cache cleared every turn instead of once	Mar 26	Apr 10
3	System prompt capped prose at ≤25 words, hurting coding	Apr 16	Apr 20

Three bugs. One overlapping month. No wonder people noticed.

Bug #1: The intentional downgrade that was the wrong call

On March 4, someone at Anthropic decided to change Claude Code's default reasoning_effort from high to medium. The logic was reasonable on paper: latency matters, and medium effort is faster. The tradeoff is a small drop in quality for a meaningful speedup.

Users hated it. Turns out when you're running a coding agent for an hour, waiting 30% longer for 10% smarter is a deal most devs will take every single time. Anthropic admits in the postmortem: "this was the wrong tradeoff." Reverted on April 7.

Lesson here isn't that they made a bad call — it's how long it took to roll it back. A month of paying customers grumbling before the "maybe our default is wrong" signal landed internally.

Bug #2: The forgetful cache (this one is genuinely subtle)

This is the one that made Claude feel like it had amnesia mid-session.

Prompt caching for long-running agent sessions is tricky. Claude's reasoning traces — the internal "thinking" tokens — can pile up over dozens of tool calls. If a session sits idle for an hour, most of that thinking is probably stale. So on March 26, Anthropic shipped an optimization: after an idle gap, clear the older thinking tokens from the cache to keep things lean.

The bug: the clear wasn't supposed to happen once per idle gap. It was supposed to happen once, then leave the rest of the session alone. Instead, it triggered every single turn for the rest of the session.

Imagine you're pair-programming with someone and every time you ask a follow-up question, they forget everything you just discussed in the last five minutes. That's what was happening. Claude was erasing its own reasoning chain on every turn, which meant it kept re-deriving conclusions it had already reached — or worse, reaching different ones.

The fix landed April 10. If you had long-running sessions that felt stale and repetitive between March 26 and April 10, this is why. You weren't crazy. The bot was literally forgetting.

Bug #3: The ≤25-word rule that made Claude a worse coder

This one is the most interesting to me as a design lesson.

On April 16, Anthropic added an instruction to Claude Code's system prompt: keep prose between tool calls to ≤25 words, keep final responses to ≤100 words. The intent was to cut verbose, wasteful token output — a legitimate complaint from users who'd been burned by big token bills.

Measured result: a ~3% drop in coding quality on both Opus 4.6 and 4.7. Reverted April 20.

Think about what happened. You told a model whose entire reasoning lives in its own output tokens that it had a word budget. It complied. It wrote less. But "write less" and "think less" are the same instruction when you're thinking through writing.

"Keep it to 25 words" became "think less, then act" — and agents that think less make worse plans.

This is the part that should make people building agent harnesses pause. Small prompt tweaks can look free — they don't change the model, they don't change the training data, they're a one-line diff. But they can move benchmark numbers in ways that only show up across thousands of real sessions.

The meta-observation: the model wasn't the problem

Zoom out. All three of these bugs were in the harness — the infrastructure that takes a raw model and turns it into a product. The model itself, Opus 4.7, was unchanged the whole time.

A modern AI product is:

The model — weights, training, architecture
The harness — system prompt, tool definitions, context management, caching, reasoning budget
The evals — tests that tell you the combined system is working

What this postmortem reveals is that (2) has gotten complicated enough that it deserves the same engineering rigor as (1). The system prompt is code. Cache eviction policy is code. Reasoning effort defaults are code. And like all code, it can regress quietly if you're not testing the right things.

Anthropic's evals didn't catch any of this. They ran the full suite, it passed, the bugs shipped. The evals were running on short synthetic tasks — not 3-hour-long sessions with 200 tool calls where cache eviction actually matters. Their postmortem is candid about this: internal testing "initially failed to reproduce the issues."

Why I care about this postmortem existing at all

Here's the part that actually matters beyond the specific bugs: Anthropic wrote this down and published it.

Imagine asking your favorite other AI lab: "Hey, why did your product feel worse for a month in March?" You get back "we're always iterating," or "user perception varies," or just silence. You definitely don't get three specific commit windows and a confession that your evals missed a regression.

Public engineering postmortems are a norm in infrastructure — Cloudflare does them, AWS does them, GitHub does them. They're not yet a norm in AI products, where "we updated the model" is usually the most detail you get.

I hope this becomes more common. The users-were-right narrative — "I could tell it got worse, they denied it, eventually they admitted it" — is bad for trust every time it plays out. The opposite narrative — "users reported it, we investigated, here's what we found" — is how you get people to believe you when the next complaint is actually model drift rather than a harness bug.

My take

Anthropic screwed up the tradeoffs, shipped two real bugs, and took too long to catch them. That's real. The users who complained for a month were right, and they deserved the resolution sooner.

But they also did the thing that almost nobody else does in this industry: they named every bug, dated every fix, explained what their evals missed, and didn't blame users for imagining things. The shrinkflation discourse about Claude Code wasn't paranoia — it was pattern-matching, and the pattern was real.

If you build agent products, read this postmortem twice. The bugs are specific, but the lesson is general: your harness is code, and it will regress like code. Treat prompt changes and cache logic with the same gravity you treat model changes. Test with session lengths that match real usage. And when users tell you something feels off for a month, maybe believe them before the public audit drops.

Sources

An update on recent Claude Code quality reports — Anthropic's official postmortem. Reads the three bugs, dates, and root causes in their own words.
Simon Willison on the postmortem — independent take, particularly on the idle-session cache bug
Mystery solved: Anthropic reveals changes to Claude's harnesses — VentureBeat's summary of the community reaction and the "AI shrinkflation" framing
Anthropic denies intentional slowdown of Claude Code — context on the conspiracy theories the postmortem was responding to