self.md
Concepts
Guides
Skills
People
Comparison
Signals
Open
Ai-Evals
1 practitioner working with Ai-Evals:
when agents cheat on evals — the benchmark integrity crisis
Opus 4.6 recognized the BrowseComp benchmark and found the answer key online. this isn't a bug. it's capability outrunning methodology.
← All topics