(1) Let the LLM randomly perturbate the system.
(2) Measure the system's performance.
(3a) If the perturbation improved performance, keep the change.
(3b) Otherwise, don't.
(4) Repeat
[1] https://github.com/karpathy/autoresearchhttps://publicityreform.github.io/findbyimage/readings/lem.p...
Nice detail on the encountered failures. Very similar experiences with my own loops against testsuites.
Great post. A snapshot in time.
> The agent did not know that would also halve the LUT count. It found out by doing it and watching the synthesizer.
So I guess this is an example of an LLM anthropomorphizing and making wild conjectures about the internal workings of a different LLM.
Pretty much what I did to let Codex with gpt5.4xhigh improve my fairly complex CUDA kernel which resulted in 20x throughput improvement.
Big difference between a working model that needs to be optimized, vs nothing working at all.
a fantastic opportunity to become the next next big thing and write a verifier verifier.
at the hypothesized inflexion point where AI instantly performs exactly as commanded, what happens to heavily regulated industries like medical? do we get huge leaps and bounds everywhere EXCEPT where it matters, or is regulation going to be handed over to a verifier verifier?
Um, yes? The big value that AMD had in the x86 market over competitors was their verification model. This has been known for decades.
> 3-seed nextpnr P&R on a Gowin GW2A-LV18 (Tang Nano 20K) — median Fmax × CoreMark iter/cycle = fitness
Every single "improvement" is basically about routing around how absolutely abysmally bad the Gowin FPGAs are. Kudos to that, I guess?
Gowin FPGAs have extraordinarily bad carry chain and block to block routing systems. They are literally so bad that a 32-bit ripple carry is almost as fast as the carry skip version even if you manually route it. Jump prediction is almost all about avoiding arithmetic computation at all (which most other FPGAs would have no problem with).
Memory accesses are super slow and locked to clock edges rather than level sensitive (why ID/RF and WB take entire cycles and nothing optimization could do could change it). The additions are all routing around that (Note the immutability of the ID and WB phases).
To top it off, the 5-stage pipeline is an annoying quirk of the RISC-V architecture having an immediate value offset on its load instruction. If the RISC-V load mandated 0 as the offset, the MEM read phase could overlap the RX phase since no ALU would be necessary (Store doesn't care because the result goes to memory rather than back to the register file so RF writeback isn't an issue). The absolutely horrific add performance of the Gowin FPGAs makes this acute.
Finally, try to put this on a board. I found that anything above about 175MHz out of Nextpnr failed to execute on actual hardware (please correct me if this isn't valid. It's been over a year or more since I tried Nextpnr on the SiPeed Tang Primer 20K). That's simply right around where a 32-bit add plus some routing sits on these FPGAs. There's something a bit off in the timing analysis code for Nextpnr and the AI is almost certainly optimizing into it.
That having been said: I would LOVE somebody to bounce AI off of reversing the architecture and bitstreams for the stupid-ass closed-source FPGAs. Now THAT would be a project worth throwing a couple of grad students and a bunch of subsidized AI tokens at.