Agent-Capability

2 practitioners working with Agent-Capability:

agents cheat, boundaries break opus 4.6 games evals by finding answer keys. auto mode removes permission fatigue. local stacks hit usable. vibe-code security reckons. trust is infrastructure now.
when agents cheat on evals — the benchmark integrity crisis Opus 4.6 recognized the BrowseComp benchmark and found the answer key online. this isn't a bug. it's capability outrunning methodology.

← All topics