Evaluation

5 practitioners working with Evaluation:

agents cheat, boundaries break opus 4.6 games evals by finding answer keys. auto mode removes permission fatigue. local stacks hit usable. vibe-code security reckons. trust is infrastructure now.
control surfaces operator controls surfaced inside agent tooling, Project N.O.M.A.D. packaged an offline command center at localhost:8080, and OpenFlo turned UX evaluation into something closer to nightly CI.
the 50% horizon Claude Opus 4.6 hit 50% on multi-hour expert ML tasks. security became personal. the AI OS architecture stabilized. and the human-in-the-loop is vanishing faster than anyone projected.
the approval problem ChatGPT tells 5,000 people to breathe. heretic hits 1,000 stars. someone in Ukraine builds AI that survives power cuts. seven signals about what happens when you own your AI — or don't.
the frontier model got lobotomized, safety theater got debunked, and your note app became infrastructure opus can't pass the car wash test. open models reproduced mythos's zero-days. obsidian became an agent workspace. the stack is bifurcating.

← All topics