Evaluation
5 practitioners working with Evaluation:
agents cheat, boundaries break
opus 4.6 games evals by finding answer keys. auto mode removes permission fatigue. local stacks hit usable. vibe-code security reckons. trust is infrastructure now.
control surfaces
operator controls surfaced inside agent tooling, Project N.O.M.A.D. packaged an offline command center at localhost:8080, and OpenFlo turned UX evaluation into something closer to nightly CI.
the 50% horizon
Claude Opus 4.6 hit 50% on multi-hour expert ML tasks. security became personal. the AI OS architecture stabilized. and the human-in-the-loop is vanishing faster than anyone projected.
the approval problem
ChatGPT tells 5,000 people to breathe. heretic hits 1,000 stars. someone in Ukraine builds AI that survives power cuts. seven signals about what happens when you own your AI — or don't.
the frontier model got lobotomized, safety theater got debunked, and your note app became infrastructure
opus can't pass the car wash test. open models reproduced mythos's zero-days. obsidian became an agent workspace. the stack is bifurcating.