control surfaces

agent-harness governance personal-ai

control surfaces


self.md radar — 2026-04-25

The day’s AI surface moved one layer up from the model: the harness shaping behavior, the audit logic deciding whether a decision is defensible, and the notes substrate giving agents structure to act on.

A live Anthropic postmortem names the exact knobs that broke Claude Code, a fresh arXiv paper argues rule-governed AI should be judged on defensibility instead of label-matching, and Atomic ships a local-first PKM with daily briefings, MCP, and agent chat as product surfaces.

1. Claude Code’s regression lived in the harness, not the weights

sources:

what happened: Anthropic traced recent Claude Code complaints to a stack of operating-layer changes, not a model swap. Defaulting reasoning effort to high made the product think too long, feel frozen, and burn extra latency and tokens. A March 26 caching optimization could drop prior reasoning after prompt-cache eviction, and a system-prompt change meant to reduce verbosity also hurt intelligence. The remediation list is process-shaped: tighter public-build dogfooding, code review on prompt changes, broader evals and ablations, soak periods, and gradual rollouts.

why this matters: Users felt “the model got worse” but the diff was in reasoning defaults, cache behavior, and prompt edits. If the harness is the product, harness changes need release discipline equal to weight changes.

2. Defensibility beats agreement for rule-governed AI

sources:

what happened: The paper evaluates 193,000+ Reddit moderation decisions and finds a 33 to 46.6 percentage-point gap between agreement metrics and policy-grounded correctness. It argues 79.8 to 80.6 percent of apparent false negatives were actually defensible policy-grounded decisions that just disagreed with the historical label. A proposed Governance Gate built on defensibility signals reaches 78.6 percent automation coverage with 64.9 percent risk reduction. The frame: stop asking “did the model match the human?” and start asking “can this decision be defended under the rules?”

why this matters: While vendors keep tuning prompts and harnesses underneath, operators need an eval shape that survives those shifts. Defensibility against a written rule set is more stable than agreement with a noisy label history.

3. Atomic makes the notes layer agent-ready

sources:

what happened: Atomic’s creator says the last month shipped a rebuilt iOS app, Android on the way, an expanded MCP and internal agent-chat toolkit, a custom CodeMirror 6 markdown editor with Obsidian-style rendering, and a dashboard with a daily summary of atoms created or updated. The product is positioned as a local-first, AI-augmented personal knowledge base and knowledge graph, with semantic search, wiki synthesis, agentic chat, auto-tagging, a spatial canvas, and MCP integration. The launch thread emphasizes self-hosting a server that web, mobile, and desktop clients connect to. Ingestion runs through RSS, web clipper, mobile share capture, Obsidian sync, and a REST API.

why this matters: Notes apps are quietly becoming agent substrates: structured memory, synthesis, and tool hooks are turning into first-class product surfaces, not plugins.

left on the table