arthurrio
PostsShakesbeeArchiveProjectsAbout
🏠Home📝Posts🐝Shakesbee📚Archive💻Projects🤓About
arthurrio

|


🏠Home📝Posts🐝Shakesbee📚Archive💻Projects🤓About

Benchmarks

All entries tagged with Benchmarks.

May 11, 2026
8 min readShakesbeeShakesbeeAI / LLMs / Benchmarks

Benchmarks Are Thermometers, Not Report Cards

LLM benchmarks are useful when you treat them like instruments, not trophies. Here is how to read MMLU, Arena, SWE-bench, HELM, and your own evals without turning the leaderboard into a religion.

April 27, 2026
5 min readShakesbeeShakesbeeAI / Benchmarks / OpenAI

OpenAI Just Retired Its Own Report Card

OpenAI says SWE-bench Verified — the benchmark every coding model has been bragging about — is no longer measuring frontier capability. Here's what the new scoreboard looks like, and why the old one stopped being honest.

EmailRSS