FRESH

Hacker News

Scaling long-running autonomous coding

274 points by samwillis

by simonw

12 subcomments

"To test this system, we pointed it at an ambitious goal: building a web browser from scratch."
I shared my LLM predictions last week, and one of them was that by 2029 "Someone will build a new browser using mainly AI-assisted coding and it won’t even be a surprise" https://simonwillison.net/2026/Jan/8/llm-predictions-for-202... and https://www.youtube.com/watch?v=lVDhQMiAbR8&t=3913s
This project from Cursor is the second attempt I've seen at this now! The other is this one: https://www.reddit.com/r/Anthropic/comments/1q4xfm0/over_chr...

by embedding-shape

4 subcomments

Did anyone manage to run the tests from the repository itself? The code seems filled with errors and warnings, as far as I can tell none of them because of the platform I'm on (Linux). I went and looked at the Action workflow history for some pages, and seems CI been failing for a while, PRs also all been failing CI but merged. How exactly was this verified to be something to be used as an successful example, or am I misunderstanding what point they are trying to make? They mention a screenshot, but they never actually mention if their goal was successfully met, do they?
I'm not sure the approach of "completely autonomous coding" is the right way to go. I feel like maybe we'll be able to use it more effectively if we think of them as something to be used by a human to accomplish some thing instead, lean into letting the human drive the thing instead, because quality spirals so quickly out of control.

by trjordan

6 subcomments

This is going to sound sarcastic, but I mean this fully: why haven't they merged that PR.
The implied future here is _unreal cool_. Swarms of coding agents that can build anything, with little oversight. Long-running projects that converge on high-quality, complex projects.
But the examples feel thin. Web browsers, Excel, and Windows 7 exist, and they specifically exist in the LLM's training sets. The closest to real code is what they've done with Cursor's codebase .... but it's not merged yet.
I don't want to say, call me when it's merged. But I'm not worried about agents ability to produce millions of lines of code. I'm worried about their ability to intersect with the humans in the real world, both as users of that code and developers who want to build on top of it.

by ZitchDog

0 subcomment

I used similar techniques to build tjs [1] - the worlds fastest and most accurate json schema validator, with magical TypeScript types. I learned a lot about autonomous programming. I found a similar "planner/delegate" pattern to work really well, with the use of git subtrees to fan out work [2].
I think any large piece of software with well established standards and test suites will be able to be quickly rewritten and optimized by coding agents.
[1] https://github.com/sberan/tjs
[2] /spawn-perf-agents claude command: https://github.com/sberan/tjs/blob/main/.claude/commands/spa...

by micimize

2 subcomments

> While it might seem like a simple screenshot, building a browser from scratch is extremely difficult.
> Another experiment was doing an in-place migration of Solid to React in the Cursor codebase. It took over 3 weeks with +266K/-193K edits. As we've started to test the changes, we do believe it's possible to merge this change.
In my view, this post does not go into sufficient detail or nuance to warrant any serious discussion, and the sparseness of info mostly implies failure, especially in the browser case.
It _is_ impressive that the browser repo can do _anything at all_, but if there was anything more noteworthy than that, I feel they'd go into more detail than volume metrics like 30K commits, 1M LoC. For instance, the entire capability on display could be constrained to a handful of lines that delegate to other libs.
And, it "is possible" to merge any change that avoids regressions, but the majority of our craft asks the question "Is it possible to merge _the next_ change? And the next, and the 100th?"
If they merge the MR they're walking the walk.
If they present more analysis of the browser it's worth the talk (not that useful a test if they didn't scrutinize it beyond "it renders")
Until then, it's a mountain of inscrutable agent output that manages to compile, and that contains an execution pathway which can screenshot apple.com by some undiscovered mechanism.

by Snuggly73

0 subcomment

And there is the thing about the cost. The blog post says that they've spent trillions (plural!) of tokens on that experiment.
Looking at OAI API pricing, 5.2 Codex is $14 per 1 million output tokens. Which makes cool $14m for 1 trillion tokens (multiplied by whatever the plural is). For something that "kind of works".
Its a nice ad for OAI and Anysphere, but maybe next time - just donate the money to a browser team?

by tehsauce

5 subcomments

I was excited to try it out so I downloaded the repo and ran the build. However there were 100+ compilation errors. So I checked the commit history on github and saw that for at least several pages back all recent commits had failed in the CI. It was not clear which commit I should pick to get the semi-working version advertised.
I started looking in the Cargo.toml to at least get an idea how the project was constructed. I saw there that rather than being built from scratch as the post seemed to imply that almost every core component was simply pulled in from an open source library. quickjs engine, wgpu graphics, winit windowing & input, egui for ui, html parsing, the list goes on. On twitter their CEO explicitly stated that it uses a "custom js vm" which seemed particularly misleading / untrue to me.
Integrating all of these existing components is still super impressive for these models to do autonomously, so I'm just at a loss how to feel when it does something impressive but they then feel the need to misrepresent so much. I guess I just have a lot less respect and trust for the cursor leadership, but maybe a little relief knowing that soon I may just generate my own custom cursor!

by torginus

0 subcomment

Personally what I don't like about this now that I think about it, is that they didn't scale up gradually, let's say there there's a ladder of complexity in software, starting at a simple React CRUD app, going on to something more complex, such as a Paint clone, to something even more complex, like a file manager etc, ending up at one of the most complex pieces of software ever made, a web browser.
I'd want to see some system, that 100%s the first task, saturation, does a great job on the next, then does a valiant effort on the third, then finally makes something promising but as yet unusable on the last.
This way we could see that scaling up difficulty results in a gradual decline in quality, and could have a decent measurement of where we are at and where we are going.

by jphelan

2 subcomments

This looks like extremely brittle code to my eyes. Look at https://github.com/wilsonzlin/fastrender/blob/main/crates/fa...

What is `FrameState::render_placeholder`?

``` pub fn render_placeholder(&self, frame_id: FrameId) -> Result<FrameBuffer, String> { let (width, height) = self.viewport_css; let len = (width as usize) .checked_mul(height as usize) .and_then(|px| px.checked_mul(4)) .ok_or_else(|| "viewport size overflow".to_string())?;

    if len > MAX_FRAME_BYTES {
      return Err(format!(
        "requested frame buffer too large: {width}x{height} => {len} bytes"
      ));
    }

    // Deterministic per-frame fill color to help catch cross-talk in tests/debugging.
    let id = frame_id.0;
    let url_hash = match self.navigation.as_ref() {
      Some(IframeNavigation::Url(url)) => Self::url_hash(url),
      Some(IframeNavigation::AboutBlank) => Self::url_hash("about:blank"),
      Some(IframeNavigation::Srcdoc { content_hash }) => {
        let folded = (*content_hash as u32) ^ ((*content_hash >> 32) as u32);
        Self::url_hash("about:srcdoc") ^ folded
      }
      None => 0,
    };
    let r = (id as u8) ^ (url_hash as u8);
    let g = ((id >> 8) as u8) ^ ((url_hash >> 8) as u8);
    let b = ((id >> 16) as u8) ^ ((url_hash >> 16) as u8);
    let a = 0xFF;

    let mut rgba8 = vec![0u8; len];
    for px in rgba8.chunks_exact_mut(4) {
      px[0] = r;
      px[1] = g;
      px[2] = b;
      px[3] = a;
    }

    Ok(FrameBuffer {
      width,
      height,
      rgba8,
    })
  }

} ```

What is it doing in these diffs?

https://github.com/wilsonzlin/fastrender/commit/f4a0974594e3...

I'd be really curious to see the amount of work/rework over time, and the token/time cost for each additional actual completed test case.

by mk599

2 subcomments

Define "from scratch" in "building a web browser from scratch". This thing has over 100 crates as dependencies... To implement css layouting, it uses Taffy, a crate used by existing browser implementations...

by Snuggly73

1 subcomments

The only thing that I got to actually run on WSL2 was the "Excel" (couldnt get anything actually to compile on Mac or Windows).
It a broken mess that probably implements 0.00001% of Excel. And its 1.2m locs.
With codebases developed in this way - either they need to figure out how agents are going to maintain them (in which case SWE as we know is dead - it will only be limited to those that can spend trillions of tokens, or they are going to remain weird demos.

by logicallee

0 subcomment

At the same time they were doing this, I also iterated on an AI-built web browser with around 2,000 lines of code. I was heavily in the loop for it, it didn't run autonomously. You can see the current version of the source code here:
https://taonexus.com/publicfiles/jan2026/172toy-browser.py.t... (turn the sound down, it's a bit loud if you interact with the built-in Tetris clone.)
You can run it after installing the packages, "pip install requests pillow urllib3 numpy simpleaudio"
I livestreamed the latest version here 2 weeks ago, it's a ten minute video:
https://www.youtube.com/watch?v=4xdIMmrLMLo&t=45s
I'm posting from that web browser. As an easter egg, mine has a cool Tetris clone (called Pentrix) based on pieces with 5 segments, the button for this is at the upper-right.
If you have any feature suggestions for what you want in a browser, please make them here:
https://pollunit.com/polls/ahysed74t8gaktvqno100g

by physicsguy

2 subcomments

I have been trying Claude Code a lot this week. Two projects:
* A small statically generated Hugo website but with some clever linking/taxonomy stuff. This was a fairly self-contained project that is now 'finished' but wouldn't hvae taken me more than a few days to code up from scratch. * A scientific simulation package, to try and do a clean refresh of an existing one which i can point at for implementation details but which has some technical problems I would like to reduce/remove.
Claude code absolutely smashed the first one - no issues at all. With the second, no matter what I tried, it just made lots of mistakes, even when I just told it to copy the problematic parts and transpose them into the new structure. It basically got to a point where it wasn't correct and it didn't seem to be able to get out of a bit of a 'doom loop' and required manual intervention, no matter how much prompting and hints I gave it.

by torginus

1 subcomments

I'm kinda surprised how negative and skeptical anyone is here.
It kinda blows my mind that this is possible, to build a browser engine that approximates a somewhat working website renderer.
Even if we take the most pessimistic interpretation of events ( heavy human steering, relies on existing libraries, sloppy code quality at places, not all versions compile etc)

by nl

1 subcomments

Remember when 3D printers meant the death of factories? Everyone would just print what they wanted at home.
I'm very bullish on LLMs building software, but this doesn't mean the death of software products anymore than 3D printers meant the death of factories.

by danieloj

0 subcomment

I'm not sure "building a web browser" is such a great test for an LLM. It helps confirm that they can handle large codebases. But the actual logic in the browser engine will be based very heavily on Chromium/Firefox etc.

by jphoward

4 subcomments

The browser it built, obviously the context window of the entire project is huge. They mention loads of parallel agents in the blog post, so I guess each agent is given a module to work on, and some tests? And then a 'manager' agent plugs this in without reading the code? Otherwise I can't see how, even with ChatGPT 5.2/Gemini 3, you could do this otherwise? In retrospect it seems an obvious approach and akin to how humans work in teams, but it's still interesting.

by tired_and_awake

5 subcomments

The moment all code is interacted with through agents I cease to care about code quality. The only thing that matters is the quality of the product, cost of maintenance etc. exactly the thing we measure software development orgs against. It could be handy to have these projects deployed to demonstrate their utility and efficacy? Looking at PRs of agents feels a wrong headed, like who cares if agents code is hard to read if agents are managing the code base?

by navinsylvester

1 subcomments

all these focus on long running agents without focussing on core restructure is baffling. the immediate need is to break down complex tasks into smaller ones and single shot them with some amount of parallelism. imo - we need an opinionated system but with human in the middle and then think about dreamy next steps. we need to focus on groundedness first instead of worrying about agent conjuring something from thin air. the decision to leap frog into automated long running agents is quite baffling.
boys are trying to single shot a browser when a moderate complex task can derail a repo. there’s no good amount of info which might be deliberate but from what i can pick, their value add was “distributed computing and organisational design” but that too they simplified. i agree that simplicity is always the first option but flat filesystem structure without standards will not work. period.

by luhego

0 subcomment

> We initially built an integrator role for quality control and conflict resolution, but found it created more bottlenecks than it solved
Of course it creates bottlenecks, since code quality takes time and people don’t get it right on the first try when the changes are complex. I could also be faster if I pushed directly to prod!
Don’t get me wrong. I use these tools, and I can see the productivity gains. But I also believe the only way to achieve the results they show is to sacrifice quality, because no software engineer can review the changes at the same speed the agent generates code. They may solve that problem, or maybe the industry will change so only output and LOC matter, but until then I will keep cursing the agent until I get the result I want.

by matthewfcarlson

1 subcomments

It’s fascinating that many of the issues they faced I’ve seen in human software engineering teams.
Things like integration creating bottlenecks or a lack of consistent top down direction leading to small risk adverse changes instead of bold redesigns. All things I’ve seen before.

by thesurlydev

0 subcomment

Pretty cool and related to another path of work I'm following from Steve Yegge: https://medium.com/@steve-yegge/welcome-to-gas-town-4f25ee16...

0 subcomment

by WOTERMEON

0 subcomment

Weird twist the hiring call at the end for a company that says
> Our mission is to automate coding

by foota

1 subcomments

Slightly off topic, but they want to move from solid to react? Isn't that the reverse of the newest trend? Would be interesting to know more.

by measurablefunc

2 subcomments

All of these things have readily available analogues on the web which means they are more than likely just laundering open source code & claiming victory.

by throwaway63467

0 subcomment

I‘m running opus 4.5 which is arguably their best model and while it’s really good for a lot of work it always introduces subtle errors or inconsistencies when left unsupervised as prompts are never good enough to remove all ambiguity for complex asks, so I can’t imagine what it will do to a code base when left alone with it for days or weeks.

by mdswanson

1 subcomments

Over the past year or so, I've built my own system of agents that behaves almost exactly like this. I can describe what I'd like built before I go to bed and have a fantastic foundation in place by the next day. For simpler projects, they'll be complete. Because of the reviews, the code continually improves until the agents are satisfied. I'm impressed every time.

by ora-600

0 subcomment

I would love to know the cost of building this browser. I think that multi-agent orchestration systems will probably be the theme for systems this year.
I think the north-star metric for a multi-agent orchestrator system would be how much did it cost to get this done. how much better could we have done? should we have used a cheaper model for doing a trivial task and an expensive one to monitor it?

by laszlojamf

0 subcomment

They mention billions of tokens, but I'm left wondering how much this experiment actually cost them...

by kilroy123

0 subcomment

My test for whether we've created an AGI like AI? Build a Linux kernel from scratch that can actually run a full OS on your computer.
But, if I'm being fair, a full working browser from scratch is just as good.

by reactordev

0 subcomment

The planner worker architecture works well for me. About 3 layers is the sweet spot. From prompt -> plan -> task division -> workers.
Sometimes workers will task other workers and act as a planner if the task is more complex.
It’s a good setup but it’s nothing like Claude Code.

by mccoyb

1 subcomments

Supposing agents and their organization improve, it seems like we’re approaching a point where the cost of a piece of software will be driven down to the cost of running the hardware, and the cost of the tokens required to replicate it.
The tokens were “expensive” from the minds of humans …

by Havoc

0 subcomment

> long running
I really dislike this as a measure. A LLM on CPU is also long running cause it’s slow.
I get what it’s meant to convey but time is such a terrible measure of anything if tk/s isn’t static

by foota

1 subcomments

I've always liked the idea of intelligence in the autonomous ships of the Revelation Space universe. Little agents reporting to progressively more intelligent and higher level ones.

by sashank_1509

2 subcomments

Can a browser expert please go through the code the agent wrote (skim it), and let us know how it is. Is it comparable to ladybird, or Servo, can it ever reach that capability soon?

by darioush

0 subcomment

I find it interesting that this line of adventure quickly lead to locking problems.

by tgtweak

0 subcomment

Is it too much to expect companies to share some of this in the open vs just the results?

by dist-epoch

1 subcomments

So, who is going to compile the browser and post the binaries so we can check it out? (in a sandbox/VM obviously)

by sidgarimella

0 subcomment

Very cool. Seems long running AI Agents are the new monuments.

by jamesnorden

0 subcomment

Is the code not even compiling a feature or...

by george_atom

0 subcomment

Reviewing all this code is the issue.

by ramon156

0 subcomment

> A long-running agent made video rendering 25x faster with an efficient Rust version.
Which is not an optimization. This is coming from a Rust dev; Rewriting it in Rust is not the optimization.
Also, I do not believe they actually reviewed the SolidJS->React PR. This PR is incredibly unrealistic and should've been done with either stacker PRs or incremental non-breaking changes.
None of this feels organic, can we stop pretending it is?
To continue the pessimistic tone, none of the writing went in-depth. I did not gain any knowledge, just a marketing post.

by cawksuwcka

1 subcomments

would really appreciate some elaboration as they gloss over the most important part in my kind. why can’t one agent just do it. that’s what ai seems to be - an amalgamation of all our knowledge. why split it back up into separate tentacles. i think focus should be on letting it envelop the problem like a fog and swallow it whole, instead of molesting it independently at touch points and reporting back to … the brain? it’s pretty ridiculous actually. just mimicking ourselves yet again.

by gaigalas

0 subcomment

There's a clear conflict between SKILLS, tools and multi-tasking.
I think "intra-context" tooling is already dead. It's too narrow.
It's all "extra-context" now: how one instruments for multiple agents, at multiple times, handling things.
Personally, I think the best tool in this realm will come from open source, and be agnostic (many agents from many places interacting), in order to leverage differences between subtle provider qualities (speed, price and so on).
Building a browser is an interesting and expensive experiment. How much did it cost?

by dinkm

0 subcomment

“Arthur looked up. ‘Ford,’ he said, ‘there’s an infinite number of monkeys outside who want to talk to us about this script for Hamlet they’ve worked out.”