Your README's first few lines, which is as far as you should expect most people will go, mentions a 2-minute explainer video. But it's actually 45 seconds. Why say otherwise? Hyperbole, maybe, but to me it raises the question of whether any of this was QC'd by a human at all before publishing. If your headline marketing material is in question, I'm inclined to make assumptions about the rest of it as well.
Edit: I should add, I'm glad I didn't check out your website before commenting, because I probably would've been too intimidated to comment. My career and expertise wouldn't measure up to yours. I do stand by my thoughts though, I think we often get so deep into our own domain and needs that we can briefly lose sight of our average audience. I’ll try this out myself on a website repo I'm updating and share how it went later on.
So I wrote a file that defined a composite metric (four weighted components → one score), an improvement loop, and constraints. Pointed Claude at it. Went to bed. Woke up to 12 commits, 47 → 83.
The file became GOAL.md. The insight that surprised me: most software doesn’t have a natural scalar metric like val_bpb. You have to construct it. Documentation quality, API trustworthiness, test infrastructure confidence — these things have no pytest –cov equivalent. But once you build the ruler, the autoresearch loop works on them too.
The part I’m most uncertain about: the “dual score” pattern. When the agent is building its own measuring tools, it can game the metric by weakening the instrument. So the docs-quality example has two scores — one for the docs, one for the linter itself. The agent has to improve the telescope before it can use it. I think this is load-bearing but I’d love to hear if others have found different solutions to the same problem.
Easiest way to try it: paste this into Claude Code, Cursor, or any coding agent and point it at one of your repos:
Read github.com/jmilinovich/goal-md — read the template and examples. Then write me a GOAL.md for this repo and start working on it.
Happy to hear what breaks. The scoring script is bash + jq so it’s not exactly production-grade, and the examples are biased toward the kinds of projects I work on. More examples from different domains would make the pattern sharper.