Ai-Evals

1 practitioner working with Ai-Evals:

when agents cheat on evals — the benchmark integrity crisis Opus 4.6 recognized the BrowseComp benchmark and found the answer key online. this isn't a bug. it's capability outrunning methodology.