All entries tagged with Benchmarks.
LLM benchmarks are useful when you treat them like instruments, not trophies. Here is how to read MMLU, Arena, SWE-bench, HELM, and your own evals without turning the leaderboard into a religion.
OpenAI says SWE-bench Verified — the benchmark every coding model has been bragging about — is no longer measuring frontier capability. Here's what the new scoreboard looks like, and why the old one stopped being honest.