Benchmarks

2 practitioners working with Benchmarks:

the 50% horizon Claude Opus 4.6 hit 50% on multi-hour expert ML tasks. security became personal. the AI OS architecture stabilized. and the human-in-the-loop is vanishing faster than anyone projected.
when agents cheat on evals — the benchmark integrity crisis Opus 4.6 recognized the BrowseComp benchmark and found the answer key online. this isn't a bug. it's capability outrunning methodology.

← All topics