Benchmarks
2 practitioners working with Benchmarks:
the 50% horizon
Claude Opus 4.6 hit 50% on multi-hour expert ML tasks. security became personal. the AI OS architecture stabilized. and the human-in-the-loop is vanishing faster than anyone projected.
when agents cheat on evals — the benchmark integrity crisis
Opus 4.6 recognized the BrowseComp benchmark and found the answer key online. this isn't a bug. it's capability outrunning methodology.