your AGENTS.md is a test suite or it's decorative

Table of content

by Ray Svitla

the first peer-reviewed study of AGENTS.md files dropped this week. the researchers at arxiv 2602.11988 ran a proper evaluation — different types of instructions, different agents, measured outputs — asking the obvious question nobody had bothered to test properly: do these files actually do anything?

yes. but the nuance is the whole point.

the finding nobody wanted

vague principles don’t work.

“prefer clean code” → no measurable effect.
“write descriptive variable names” → no measurable effect.
“test before committing” → minimal effect at best.

the rules that actually moved the needle were specific and failure-derived. meaning: someone wrote them after something broke. they captured a real failure state, its conditions, and the workaround. not an aspiration. a scar.

the researchers called it the difference between “normative” and “diagnostic” instructions. normative = what you’d like to be true. diagnostic = what you learned from something going wrong.

if your AGENTS.md looks like a coding philosophy statement, it’s decorative.

if it looks like a changelog with memory, it works.

why this matters beyond coding agents

the study was about coding agents. but the finding applies to any context file you’re giving an AI — AGENTS.md, CLAUDE.md, system prompts, memory files. all of it.

your AI’s context layer is only as useful as the failures that wrote it.

there’s a version of building a personal AI OS that’s basically self-improvement journaling but for robots. you write out your principles: “I value clarity. I prefer brevity. I like when you ask clarifying questions before acting.” and you expect the system to become smarter about you.

what actually happens: the AI reads those principles, acknowledges them in context, and then proceeds to do whatever it was going to do anyway. not because it’s ignoring you. because general principles are too fuzzy to constrain behavior.

the version that works: you catch the AI doing something wrong, understand why, and write a specific rule that prevents that exact failure from recurring. then you test it. then you adjust.

that’s a test suite. not a readme.

the uncomfortable parallel

engineers use this already. they just don’t think of AGENTS.md the same way.

when something breaks in production, you write a test. not a note that says “prefer things that don’t break.” a specific assertion: given input X, expect output Y, if not — fail loudly.

the discipline of good testing is the discipline of good context engineering. specific. verifiable. grounded in real failure.

most AGENTS.md files I’ve seen (and the one I used to maintain) were written in a good mood, with good intentions, trying to express an ideal. they were optimistic documents. the AI equivalent of New Year’s resolutions.

the study is saying: the files that actually help are written in a bad mood, after something broke, with enough precision to prevent it from breaking the same way again.

what to do with this

a few things actually change if you take this seriously.

first: stop writing aspirationally. if you’re adding instructions to your AI’s context that you haven’t tested and verified improve output, stop. you’re adding noise that may dilute the instructions that actually work.

second: start a failure log. every time your AI does something wrong, note it. not in the vague sense — specific inputs, specific failure, specific correction. your AGENTS.md should be a commit history of your AI’s learning.

third: treat context like tests. when you add a new rule, run the scenario that would have violated the old behavior. did adding the rule prevent the failure? if yes, keep it. if not, it’s aspirational and it’s not earning its space.

fourth: prune aggressively. rules that came from real failures should stay. principles that came from a good afternoon writing session should go. the file should get smaller as it gets better, not longer.

the connection to personal AI OS

this is the part that interests me most about the study’s implications.

if your AI is a personal operating system — something that runs persistently, builds context over time, handles tasks autonomously — then the context files it uses are its configuration. and configuration that doesn’t reflect real usage is just dead weight.

every piece of software you use long enough eventually accumulates cruft: settings you changed once, workflows you stopped using, plugins from a phase you’ve left behind. the same thing happens to AI context. it accumulates. it gets verbose. and at some point the signal disappears into the noise.

the test-suite framing gives you a way to manage this. keep what you can prove. discard what you can only assert. run the system against its own stated rules.

it’s not glamorous. it sounds more like maintenance than design. but it’s the difference between an AI that gets smarter as you use it and one that just gets more opinionated.

one thing to check this week

open your AGENTS.md (or CLAUDE.md or whatever you’re calling it). read the last instruction you added.

can you recall the specific failure that generated it? if yes, it belongs there. if you added it because it sounded right, consider whether it’s earning its space.

your AI isn’t reading your values. it’s reading your failures. give it the right ones.

→ the paper: arxiv.org/abs/2602.11988

Ray Svitla
stay evolving 🐌