FRESH

Hacker News

Home

Opus 4.5 is not the normal AI agent experience that I have had thus far

845 points by tbassetto

by OldGreenYodaGPT

32 subcomments

Most software engineers are seriously sleeping on how good LLM agents are right now, especially something like Claude Code.
Once you’ve got Claude Code set up, you can point it at your codebase, have it learn your conventions, pull in best practices, and refine everything until it’s basically operating like a super-powered teammate. The real unlock is building a solid set of reusable “skills” plus a few agents for the stuff you do all the time.
For example, we have a custom UI library, and Claude Code has a skill that explains exactly how to use it. Same for how we write Storybooks, how we structure APIs, and basically how we want everything done in our repo. So when it generates code, it already matches our patterns and standards out of the box.
We also had Claude Code create a bunch of ESLint automation, including custom ESLint rules and lint checks that catch and auto-handle a lot of stuff before it even hits review.
Then we take it further: we have a deep code review agent Claude Code runs after changes are made. And when a PR goes up, we have another Claude Code agent that does a full PR review, following a detailed markdown checklist we’ve written for it.
On top of that, we’ve got like five other Claude Code GitHub workflow agents that run on a schedule. One of them reads all commits from the last month and makes sure docs are still aligned. Another checks for gaps in end-to-end coverage. Stuff like that. A ton of maintenance and quality work is just… automated. It runs ridiculously smoothly.
We even use Claude Code for ticket triage. It reads the ticket, digs into the codebase, and leaves a comment with what it thinks should be done. So when an engineer picks it up, they’re basically starting halfway through already.
There is so much low-hanging fruit here that it honestly blows my mind people aren’t all over it. 2026 is going to be a wake-up call.
(used voice to text then had claude reword, I am lazy and not gonna hand write it all for yall sorry!)
Edit: made an example repo for ya
https://github.com/ChrisWiles/claude-code-showcase

by mcv

8 subcomments

Opus 4.5 ate through my Copilot quota last month, and it's already halfway through it for this month. I've used it a lot, for really complex code.
And my conclusion is: it's still not as smart as a good human programmer. It frequently got stuck, went down wrong paths, ignored what I told it to do to do something wrong, or even repeat a previous mistake I had to correct.
Yet in other ways, it's unbelievably good. I can give it a directory full of code to analyze, and it can tell me it's an implementation of Kozo Sugiyama's dagre graph layout algorithm, and immediately identify the file with the error. That's unbelievably impressive. Unfortunately it can't fix the error. The error was one of the many errors it made during previous sessions.
So my verdict is that it's great for code analysis, and it's fantastic for injecting some book knowledge on complex topics into your programming, but it can't tackle those complex problems by itself.
Yesterday and today I was upgrading a bunch of unit tests because of a dependency upgrade, and while it was occasionally very helpful, it also regularly got stuck. I got a lot more done than usual in the same time, but I do wonder if it wasn't too much. Wasn't there an easier way to do this? I didn't look for it, because every step of the way, Opus's solution seemed obvious and easy, and I had no idea how deep a pit it was getting me into. I should have been more critical of the direction it was pointing to.

by multisport

37 subcomments

What bothers me about posts like this is: mid-level engineers are not tasked with atomic, greenfield projects. If all an engineer did all day was build apps from scratch, with no expectation that others may come along and extend, build on top of, or depend on, then sure, Opus 4.5 could replace them. The hard thing about engineering is not "building a thing that works", its building it the right way, in an easily understood way, in a way that's easily extensible.
No doubt I could give Opus 4.5 "build be a XYZ app" and it will do well. But day to day, when I ask it "build me this feature" it uses strange abstractions, and often requires several attempts on my part to do it in the way I consider "right". Any non-technical person might read that and go "if it works it works" but any reasonable engineer will know that thats not enough.

by s-macke

5 subcomments

Opus 4.5 has become really capable.
Not in terms of knowledge. That was already phenomenal. But in its ability to act independently: to make decisions, collaborate with me to solve problems, ask follow-up questions, write plans and actually execute them.
You have to experience it yourself on your own real problems and over the course of days or weeks.
Every coding problem I was able to define clearly enough within the limits of the context window, the chatbot could solve and these weren’t easy. It wasn’t just about writing and testing code. It also involved reverse engineering and cracking encoding-related problems. The most impressive part was how actively it worked on problems in a tight feedback loop.
In the traditional sense, I haven’t really coded privately at all in recent weeks. Instead, I’ve been guiding and directing, having it write specifications, and then refining and improving them.
Curious how this will perform in complex, large production environments.

by YesBox

7 subcomments

I've noticed a huge drop in negative comments on HN when discussing LLMs in the last 1-2 months.
All the LLM coded projects I've seen shared so far[1] have been tech toys though. I've watched things pop up on my twitter feed (usually games related), then quietly go off air before reaching a gold release (I manually keep up to date with what I've found, so it's not the algorithm).
I find this all very interesting: LLMs dont change the fundamental drives needed to build successful products. I feel like I'm observing the TikTokification of software development. I dont know why people aren't finishing. Maybe they stop when the "real work" kicks in. Or maybe they hit the limits of what LLMs can do (so far). Maybe they jump to the next idea to keep chasing the rush.
Acquiring context requires real work, and I dont see a way forward to automating that away. And to be clear, context is human needs; i.e. the reasons why someone will use your product. In the game development world, it's very difficult to overstate how much work needs to be done to create a smooth, enjoyable experience for the player.
While anyone may be able to create a suite of apps in a weekend, I think very few of them will have the patience and time to maintain them (just like software development before LLMs! i.e. Linux, open source software, etc.).
[1] yes, selection bias. There are A LOT of AI devs just marketing their LLMs. Also it's DEFINITELY too early to be certain. Take everything Im saying with a one pound grain of salt.

by rcarmo

4 subcomments

I had a similar set of experiences with GPT 5.x over the holiday break, across somewhat more disparate domains: https://taoofmac.com/space/notes/2025/12/31/1830
I hacked together a Swift tool to replace a Python automation I had, merged an ARM JIT engine into a 68k emulator, and even got a very decent start on a synth project I’ve been meaning to do for years.
What has become immensely apparent to me is that even gpt-5-mini can create decent Go CLI apps provided you write down a coherent spec and review the code as if it was a peer’s pull request (the VS Code base prompts and tooling steer even dumb models through a pretty decent workflow).
GPT 5.2 and the codex variants are, to me, every bit as good as Opus but without the groveling and emojis - I can ask it to build an entire CI workflow and it does it in pretty much one shot if I give it the steps I want.
So for me at least this model generation is a huge force multiplier (but I’ve always been the type to plan before coding and reason out most of the details before I start, so it might be a matter of method).

by tripledry

3 subcomments

Putting the performance aside for now as I just started trying out Opus 4.5, can't say too much yet, I don't hype or hate AI as of now, it's simply useful.
Time will tell what happens, but if programming becomes "prompt engineering", I'm planning on quitting my job and pivoting to something else. It's nice to get stuff working fast, but AI just sucks the joy out of building for me.
Trying to not feel the pressure/anxiety from this, but every time a new model drops there is this tiny moment where I think "Is it actually different this time?"

by hollandburke

2 subcomments

Author of the post here.
I appreciate the spirited debate and I agree with most of it - on both sides. It's a strange place to be where I think both arguments for and against this case make perfect sense. All I have to go on then is my personal experience, which is the only objective thing I've got. This entire profession feels stochastic these days.
A few points of clarification...
1. I don't speak for anyone but myself. I'm wrong at least half the time so you've been warned.
2. I didn't use any fancy workflows to build these things. Just used dictation to talk to GitHub Copilot in VS Code. There is a custom agent prompt toward the end of the post I used, but it's mostly to coerce Opus 4.5 into using subagents and context7 - the only MCP I used. There is no plan, implement - nothing like that. On occasion I would have it generate a plan or summary, but no fancy prompt needed to do that - just ask for it. The agent harness in VS Code for Opus 4.5 is remarkably good.
3. When I say AI is going to replace developers, I mean that in the sense that it will do what we are doing now. It already is for me. That said, I think there's a strong case that we will have more devs - not less. Think about it - if anyone with solid systems knowledge can build anything, the only way you can ship more differentiating features than me is to build more of them. That is going to take more people, not more agents. Agents can only scale as far as the humans who manage them.
New account because now you know who I am :)

by simonw

7 subcomments

Opus 4.5 really is something else. I've been having a ton of fun throwing absurdly difficult problems at it recently and it keeps on surprising me.
A JavaScript interpreter written in Python? How about a WebAssembly runtime in Python? How about porting BurntSushi's absurdly great Rust optimized string search routines to C and making them faster?
And these are mostly just casual experiments, often run from my phone!

by honeycrispy

7 subcomments

A couple weeks ago I had Opus 4.5 go over my project and improve anything it could find. It "worked" but the architecture decisions it made were baffling, and had many, many bugs. I had to rewrite half of the code. I'm not an AI hater, I love AI for tests, finding bugs, and small chores. Opus is great for specific, targeted tasks. But don't ask it to do any general architecture, because you'll be soon to regret it.

by jedberg

2 subcomments

I had an app I wanted for over a decade. I even wrote a prototype 10 years ago. It was fine but wasn't good enough to use, so I didn't use it.
This weekend I explained to Claude what I wanted the app to do, and then gave it the crappy code I wrote 10 years ago as a starting point.
It made the app exactly as I described it the first time. From there, now that I had a working app that I liked, I iterated a few times to add new features. Only once did it not get it correct, and I had to tell it what I thought the problem was (that it made the viewport too small). And after that it was working again.
I did in 30 minutes with Claude what I had try to do in a few hours previously.
Where it got stuck however was when I asked it to convert it to a screensaver for the Mac. It just had no idea what to do. But that was Claude on the web, not Claude Code. I'm going to try it with CC and see if I can get it.
I also did the same thing with a Chrome plugin for Gmail. Something I've wanted for nearly 20 years, and could never figure out how to do (basically sort by sender). I got Opus 4.5 to make me a plugin to do it and it only took a few iterations.
I look forward to finally getting all those small apps and plugins I've wanted forever.

by Herring

1 subcomments

Me and Opus have a lot in common. We both hit our weekly limit on Monday at 10am.

by poisonborz

4 subcomments

I see these posts left and right but no one mentions the _actual_ thing developers are hired for, responsibility. You could use whatever tools to aid coding already, even copy paste from StackOverflow or take whole boilerplate projects from Github already. No AI will take responsibility for code or fix a burning issue that arises because of it. The amount of "responsibility takers" also increases linearly with the size of the codebase / amount of projects.

by maciejzj

6 subcomments

I've been on a small adventure of posting more actively on HN since the release of Gemini 3, trying to stir debate around the more “societal” aspects of what's going on with AI.
Regardless of how much you value Cloud Code technically, there is no denying that it has/will have huge impact. If technology knowledge and development are commoditised and distributed via subscription, huge societal changes are going to happen. Image what will happen to Ireland if Accenture dissolves, or what will happen to the millions of Indians when IT outsourcing becomes economically irrelevant. Will Seattle become new Detroit after Microsoft automates Windows maintenance? What about the hairdressers, cooks, lawyers, etc. who provided services for IT labourers/companies in California?
Lot of people here (especially Anthropic-adjacent) like to extrapolate the trends and draw conclusions up to the point when they say that white-collar labourers will not be needed anymore. I would like these people to have courage to take this one step further and connect this resolution with the housing crisis, loneliness epidemic, college debts, and job market crisis for people under 30.
It feels like we are diving head first into societal crisis of unparalleled scale and the people behind the steering wheel are excited to push the accelerator pedal even more.

by tannedNerd

11 subcomments

The problem with this is none of this is production quality. You haven’t done edge case testing for user mistakes, a security audit, or even just maintainability.
Yes opus 4.5 seems great but most of the time it tries to vastly over complicate a solution. Its answer will be 10x harder to maintain and debug than the simpler solution a human would have created by thinking about the constraints of keeping code working.

by soulofmischief

4 subcomments

Opus 4.5 is currently helping me write a novel, comprehensive and highly performant programming language with all of the things I've ever wanted, done in exactly my opinionated way.
This project would have taken me years of specialization and research to do right. Opus's strength has been the ability to both speak broadly and also drill down into low-level implementations.
I can express an intent, and have some discussion back and forth around various possible designs and implementations to achieve my goals, and then I can be preparing for other tasks while Opus works in the background. I ask Opus to loop me in any time there are decisions to be made, and I ask it to clearly explain things to me.
Contrary to losing skills, I feel that I have rapidly gained a lot of knowledge about low-level systems programming. It feels like pair programming with an agentic model has finally become viable.
I will be clear though, it takes the steady hand of an experience and attentive senior developer + product designer to understand how to maintain constraints on the system that allow the codebase to grow in a way that is maintainable on the long-term. This is especially important, because the larger the codebase is, the harder it becomes for agentic models to reason holistically about large-scale changes or how new features should properly integrate into the system.
If left to its own devices, Opus 4.5 will delete things, change specification, shirk responsibilities in lieu of hacky band-aids, etc. You need to know the stack well so that you can assist with debugging and reasoning about code quality and organization. It is not a panacea. But it's ground-breaking. This is going to be my most productive year in my life.
On the flip side though, things are going to change extremely fast once large-scale, profitable infrastructure becomes easily replicable, and spinning up a targeted phishing campaign takes five seconds and a walk around the park. And our workforce will probably start shrinking permanently over the next few years if progress does not hit a wall.
Among other things, I do predict we will see a resurgence of smol web communities now that independent web development is becoming much more accessible again, closer to how it when I first got into it back in the early 2000's.

by ChrisbyMe

3 subcomments

Mm this is my experience as well, but I'm not particularly worried about software engineering a whole.
If anything this example shows that these cli tools give regular devs much higher leverage.
There's a lot of software labor that is like, go to the lowest cost country, hire some mediocre people there and then hire some US guy to manage them.
That's the biggest target of this stuff, because now that US guy can just get equal or hight code in both quality and output without the coordination cost.
But unless we get to the point where you can do what I call "hypercode" I don't think we'll see SWEs as a whole category die.
Just like we don't understand assembly but still need technical skills when things go wrong, there's always value in low level technical skills.

by kachapopopow

2 subcomments

It's also the feeling I have, opus is not a ground-breaking model by any means.
However, Opus 4.5 is incredible when you give it everything it needs, a direction, what you have versus what you want and it will make it work, really, it will work. The code might me ugly, undesirable, would only work for that one condition, but with futher prompting you can evolve it and produce something that you can be proud of.
Opus is only as good as the user and the tools the user gives to it. Hmm, that's starting to sound kind-of... human...

by losvedir

3 subcomments

I'm kind of surprised how many people are okay with deploying code that hasn't been audited.
I read If Anyone Builds It Everyone Dies over the break. The basic premise was that we can't "align" AI so when we turn it loose in an agent loop what it produces isn't necessarily what we want. It may be on the surface, to appease us and pass a cursory inspection, but it could embed other stuff according to other goals.
On the whole, I found it a little silly and implausible, but I'm second guessing parts of that response now that I'm seeing more people (this post, the Gas Town thing on the front page earlier) go all-in on vibe coding. There is likely to be a large body of running software out there that will be created by agents and never inspected by humans.
I think a more plausible failure mode in the near future (next year or two) is something more like a "worm". Someone building an agent with the explicit instructions to try to replicate itself. Opus 4.5 and GPT 5.2 are good enough that in an agent loop they could pretty thoroughly investigate any system they land on, and try to use a few ways to propagate their agent wrapper.

by ben-gy

4 subcomments

I second this article - I built twelve iOS/Mac apps in two weeks with Opus 4.5 - four of them are already in the App Store - I’m a Rails Engineer and never had the time to learn Swift but man does Opus 4.5 make that not even matter - it even handles entitlements, logo & splash screen generation, refactors to remove dead code, edge case assent and hardening, Multiplatform app design, and more - I’m yet to run into a use case it can’t handle for most general use cases - that said, I have found some common mistakes it makes (by common I mean almost every time); puts iOS line list line items in buttons making them blue when they should not be, doesn’t set defaults for new data structure variables which crashes the app when changing the data structure after the fact, design consistent after the first shot (minor things like white background instead of grey background like all the other screens already, etc) - the one thing that i know it cant do well (and no other model that I know of can do this well either) is ASTM bi-directional communications (we work with pathology analysers that use this 1995 frame-based communication standard), even when you load it up with the spec and supporting docs - I suspect this is due to a dirty of available codebases that tackle this problem due to its niche and generally proprietary nature…

by haolez

1 subcomments

I have a different concern: the SOTA products are expensive and get dumbed down on busy times. My personal strategy has been to be a late follower, where I adopt new AI tools when the competition has caught up with the previous SOTA, and now there are many tools that are cost effective and equally good.
Can't wait for when the competition catches up with Claude Code, especially the open source/weights Chinese alternatives :)

by LatencyKills

1 subcomments

I really wonder what means for software moving forward. In the last few months I've used Claude Code to build personalized versions of Superwhisper (voice-to-text), CleanShot X (screenshot and image markup), and TextSniper (image to text). The only cost was some time and my $20/month subscription.

by qnleigh

3 subcomments

So much of the conversation is around these models replacing software engineers. But the use cases described in the article sound like pretty compelling business opportunities; if the custom apps he built for his wife's business have been useful, probably there are lots of businesses that would pay for the service he just provided his wife. Small, custom apps can be made way more cheaply now, so Jeven's paradox says that demand should go up. I think it will.
I would love to hear from some freelance programmers how LLMs have changed their work in the last two years.

by dzonga

1 subcomments

what strikes me about these posts is they praise models for apps | utilities commonly found on GitHub.
ie well known paths based on training data.
what's never posted is someone building something that solves a real problem in the real world - that deals with messy data | interfaces.
I like a.i to do the common routine tasks that I don't like to do like apply tailwind styles but being renter and faking productivity that's not it

by artdigital

0 subcomment

I switched my subscription from Claude to ChatGPT around 5.0 when SOTA was Sonnet 4.5 and found GPT-5-high (and now 5.2-high) so incredibly good, I could never imagine Opus is on its level. I give gpt-5.2-high a spec, it works for 20 minutes and the result is almost perfect and tested. I very rarely have to make changes.
It never duplicates code, implements something again and leaves the old code around, breaks my convention, hallucinates, or tells me it’s done when the code doesn’t even compile, which sonnet 4.5 and Opus 4.1 did all the time
I’m wondering if this had changed with Opus 4.5 since so many people are raving about it now. What’s your experience?
Claude - fast, to the point but maybe only 85% - 90% there and needs closer observation while it works
GPT-x-high (or xhigh) - you tell it what to do, it will work slowly but precise and the solution is exactly what you want. 98% there, needs no supervision

by Workaccount2

1 subcomments

Anthropic dropped out of the general "AGI" race and seems to be purely focused on coding, maybe racing to get the first "automated machine learning programmer". Whatever the case, it seems to be paying (coding) dividends to just be focusing on coding.

by llmslave2

0 subcomment

I see Anthropics marketing campaign is out in full force today ahead of their IPO.

by mr_o47

0 subcomment

Reading this blog post makes me wanna rethink my career, Opus 4.5 is really good I was recently working on solving my own problem by developing a software solution and let me tell you it was really good at it,
If I had done the same thing Pre LLM era it would have taken me months

by mpalmer

1 subcomments

After reading that article, I see at least one thing that Opus 4.5 is clearly not going to change.
There is no fixed truth regarding what an "app" is, does, or looks like. Let alone the device it runs on or the technology it uses.
But to an LLM, there are only fixed truths (and in my experience, only three or four possible families of design for an application).
Opus 4.5 produces correct code more often, but when the human at the keyboard is trying to avoid making any engineering decisions, the code will continue to be boring.

by bennydog224

0 subcomment

Don't want to discredit Opus at all, it's easy at directed tasks but it's not the silver bullet yet.
It is best in its class, but trips up frequently with complicated engineering tasks involving dynamic variables. Think: Browser page loading, designing for a system where it will "forget" to account for race conditions, etc.
Still, this gets me very excited for the next generation of models from Anthropic for heavy tasks.

by mattfrommars

4 subcomments

I’ve been saying this a countless time, LLM are great to build toy and experimental projects.
I’m not shaming but I personally need to know if my sentiment is correct or not or I just don’t know how to use LLMs
Can vibe coder gurus create operating system from scratch that competes with Linux and make it generate code that basically isn’t Linux since LLM are trained on said the source code …
Also all this on $20 plan. Free and self host solution will be best

by headcanon

0 subcomment

Yep, I literally built this last night with Opus 4.5 after my wife and I challenged each other to a typing competition. I gave it direction and feedback but it wrote all the actual code. Wasn't a one shot (maybe 3-4 shot) but didn't really have to think about it all that hard.
https://chronick.github.io/typing-arena/
With another more substantial personal project (Eurorack module firmware, almost ready to release), I set up Claude Code to act as a design assistant, where I'd give it feedback on current implementation, and it would go through several rounds of design/review/design/review until I honed it down. It had several good ideas that I wouldn't have thought of otherwise (or at least would have taken me much longer to do).
Really excited to do some other projects after this one is done.

by PaulHoule

0 subcomment

I'll argue many of his cases are things that are straightforward except for the boilerplate that surrounds them which are often emotionally difficult or prone to rabbit holes.
Like that first one where he writes a right-click handler, off the top of my head I have no idea how I would do that, I could see it taking a few hours to just set up a dev environment, and I would probably overthink the research. I was working on something where Junie suggested I write a browser extension for Firefox and I was initially intimidated at the thought but it banged out something in just a few minutes that basically worked after the second prompt.
Similarly the Facebook autoposter is completely straightforward to code but it can be so emotionally exhausting to fight with authentication APIs, a big part of the coding agent story isn't just that it saves you time but that they can be strong when you are emotionally weak.
The one which seems the hardest is the one that does the routing and travel time estimation which I'd imagine is calling out to some API or library. I used to work at a place that did sales territory optimization and we had one product that would help work out routes for sales and service people who travel from customer to customer and we had a specialist code that stuff in C++ and he had a very different viewpoint than me, he was good at what he did and could get that kind of code to run fast but I wouldn't have trusted him to even look at applications code.

by nphardon

0 subcomment

Sonnet 4.5 did it for me. Cant imagine coding without it now, and if you look at my comments from three months ago, you'll see I'm eating crow now. I easily hit >10x productivity with Sonnet 4.5 and Opus. I use Opus for my industry C and math work and Sonnet 4.5 for my swiftui side project.
I think the gap between Sonnet 4.5 and Opus is pretty small, compared to the absolute chasm between like gpt-4.1, grok, etc. vs Sonnet.

by fractallyte

1 subcomments

That final line: "Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5"
Yeah, GRAMMAR
For all the wonderment of the article, tripping up on a penultimate word that was supposedly checked by AI suddenly calls into question everything that went before...

by oncallthrow

3 subcomments

Yeah Opus 4.5 is a massive step change in my experience. I feel like I’m working with a peer, not a junior I’m having to direct. I can give it highly ambiguous and poorly specified tasks and it… just does it.
I will note that my experience varies slightly by language though. I’ve found it’s not as good at typescript.

by ycombiredd

1 subcomments

I can't quite figure out what sort of irony the blurb at the bottom of the post is. (I'm unsure if it was intentional snark, a human typo, or an inadvertent demonstration of Haiku not being well suited for spelling and grammar checks), but either way I got a chuckle:
> Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5

by egorfine

1 subcomments

So I decided to try the revered hands-off approach and have Claude Code create me a small tool in JS for *.dylib bundle consolidation on macOS.
I have used AskUserQuestionTool to complete my initial spec. And then Opus 4.5 created the tool according to that extensive and detailed spec.
It appeared to work out of the box.
Boy how horrific was the code. Unnecessary recursions, unused variables, data structures being built with no usage, deep branch nesting and weird code that is hard to understand because of how illogical it is.
And yes, it was broken on many levels and did not and could not do the job properly.
I then had to rewrite the tool from scratch and overall I have definitely spent more time spec'ing and understanding Claude code than if I have just written this tool from scratch initially.
Then I tried again for a small tool I needed to run codesign in parallel: https://github.com/egorFiNE/codesign-parallel
Same thing. Same outcome, had to rewrite.

by throw10920

1 subcomments

Does anyone have a boring, multi-hour-long coding session with an agent that they've recorded and put on Vimeo or something?
As many other commentators have said, individual results vary extremely widely. I'd love to be able to look at the footage of either someone who claims a 10x productivity increase, or someone who claims no productivity increase, to see what's happening.

by funnyfoobar

0 subcomment

I was not expecting a couple of new apps being built, when the premise of the blog post talks about replacing "mid level engineers"
the thing about being an engineer at commercial capacity is "maintaining/enhancing an existing program/software system that has been developed over years by multiple people(including those who already left) and do it in a way that does not cause any outages/bugs/break existing functionality.
while the blog post mentions about the ability of using AI to generate new applications, but it does not talk about maintaining one over a longer period of time. for that, you would need real users, real constraints, and real feature requests which preferably pay you so you can priortize them.
I would love to see such blog posts where for example, a PM is able to add features for a period of one month without breaking the production, but it would be a very costly experiment.

by takinola

1 subcomments

I guess the best analogy I can think of is the transition from writing assembly language and the introduction of compilers. Now, (almost) no one knows, or cares, what comes out of the compiler. We just assume it is optimized and that it represents the source code faithfully. Seems like code might go that way too and people will focus on the right prompts and can simply assume the code will be correct.

by raldi

1 subcomments

Despite the abuse of quotation marks in the screenshot at the top of this link, Dario Amodei did not in fact say those words or any other words with the same meaning.

by shnpln

0 subcomment

I have used Claude Code for a variety of hobby projects. I am truly astounded at its capabilities.
If you tell it to use linters and other kinds of code analysis tools it takes it to the next level. Ruff for Python or Clippy for Rust for example. The LLM makes so much code so fast and then passes it through these tools and actually understands what the tools say and it goes and makes the changes. I have created a whole tool chain that I put in a pre commit text file in my repos and tell the LLM something like "Look in this text file and use every tool you see listed to improve code quality".
That being said, I doubt it can turn a non-dev into a dev still, it just makes competent devs way better still.
I still need to be able to understand what it is doing and what the tools are for to even have a chance to give it the guardrails it should follow.

by smusamashah

0 subcomment

What about Sonnet 4.5? I used both Opus and Sonnet on Claude.ai and found sonnet much better at following instructions and doing exactly what was asked.
(it was for single html/js PWA to measure and track heart rate)
Opus seems to go less deep, does it's own things, do not follow instructions exactly EVEN IF I WROTE ALL CAPS. With Sonnet 4.5 I can understand everything author is saying. May be Opus is optimised for Claude code and Sonnet works best on Web.

by mcpar-land

0 subcomment

It is very funny to start your article off with a bunch of breathless headlines about agents replacing human coders by the end of 2025, none of which happened, then the rest of the article is "okay but this time for real, an agent really WILL replace human coders."

by lagniappe

1 subcomments

Title is: "Opus 4.5 is going to change everything"

by avidphantasm

1 subcomments

Cool. Please check back in with us after they’ve raised the price 50x and you can no longer build anything because you are alienated from your tools.

0 subcomment

by delduca

2 subcomments

I agree, it wrote an entire NES emulator for me.
https://news.ycombinator.com/item?id=46443767

by emsign

1 subcomments

The worst part about this is that you can't know anymore whether the software you trustingly install on your hardware is clean or if it was coded by a misaligned coding model with a secret goal that it has hidden from its prompt engineer and from you.
This could pretty much be the beginning of the end of everything, if misaligned models wanted to they could install killswitches everywhere. And you can't trust security updates either so you are even more vulnerable to external exploits.
It's really scary, I fear the future, it's going to be so bad. It's best to not touch AI at all and stay hidden from it as long as possible to survive the catastrophe or not be a helping part of it. Don't turn your devices into a node of a clandestine bot net that is only waiting to conspire against us.

by Kon5ole

0 subcomment

I agree with the OP that I can get LLM's to do things now that I wouldn't even attempt a year ago, but I feel it has more to do with my own experience using LLM's (and the surrounding tools) than the actual models themselves.
I use copilot and change models often, and haven't really noticed any major differences between them, except some of the newer ones are very slow.
I generally feel the smaller and faster ones are more useful since they will let me discover problems with my prompt or context faster.
Maybe I'm simply not using LLM's in a way that lets the superiority of newer models reveal itself properly, but there is a huge financial incentive for LLM makers to pretend that their model has game-changing "special sauce" even if it doesn't.

by MarsIronPI

0 subcomment

It worries me that the best models, the ones that can one-shot apps and such, are all non-free and owned by companies who can't be trusted to have end-users' best interests at heart. It would be greatly reassuring to see a self-hostable model that can compete with Opus 4.5 and Gemini 3 at such coding tasks.

by oldnewthing

0 subcomment

Claude Code is very good; good enough that I upgraded to the Max plan this week. However, it has a long way to go. It's great at one-shotting (with iterations) most ideas. However, it doesn't do as well when the task is complicated in an existing codebase. This weekend I migrated the backend for the SaaS I am building from Python to .NET Core. It did the migration but completely missed the conventions that the frontend was using to call the backend. While the converion itself went OK, every user journey was broken. I am still manually testing every code path and feeding in the errors to get Claude to fix it. My instructions were fairly comprehensive but Claude still missed most of it. My fault that I didn't generate tests first, but after this migration that's my first task.

by daxfohl

0 subcomment

This resonates with my experience in codex 5.2, at least directionally. I'm pretty persnickety about code itself, so I'm not to the point where I'll just let it rip. But in the last month or two things have gone from "I'll ask on the web interface and maybe copy some code into the project", to trusting the agent and getting a reasonable starting point about half the time.
> because models like to write code WAY more than they like to delete it
Yeah, this is the big one. I haven't figured it out either. New or changing requirements are almost always implemented a flurry of if/else branches all over the place, rather than taking the time for a step back and a reimagining of a cohesive integration of old and new. I've had occasional luck asking for this explicitly, but far more frequently they'll respond with recommendations that are far more mechanical, e.g. "you could extract a function for these two lines of code that you repeat twice", not architectural, in nature. (I still find pasting a bunch of files into the chat interface and iterating on refinements conversationally to be faster and produce better results).
That said, I'm convinced now that it'll get there sooner or later. At that point, I really don't know what purpose SWEs will serve. For a while we might serve as go-betweens between the coding agent and PMs, but LLMs are already way better at translating from tech jargon to human, so I can't imagine it would be long before product starts bypassing us and talking directly to the agents, who (err, which) can respond with various design alternatives, pros and cons of each, identify all the dependencies, possible compatibility concerns, alignment with future direction, migration time, compute cost, user education and adoption tracking, etc, all in real time in fluent PM-ese. IDK what value I add to that equation.
For the last year or so I figured we'd probably hit a wall before AI got to that point, but over the last month or so, I'm convinced it's only a matter of time.

by noisy_boy

0 subcomment

All great until the code in production pushed by Opus 314.15 breaks and Opus 602.21, despite it's many tries, can't fix it and ends it with "I apologize". That's when you need a developer who can be told "fix it". But what if all the developers then are "Opus 600+ certified" ai-native and are completely incapable of working without it's assistance? World powers decide to open the forbidden vault in the Arctic and despite many warnings on the chamber, decide to raise the foul-mouthed programmer-demon called Torvalds....

0 subcomment

by prokopton

0 subcomment

I asked Claude’s opinion and it disagreed. :)
Claude’s response:
The article’s central tension is real - Burke went from skeptic to believer by building four increasingly complex apps in rapid succession using Opus 4.5. But his evidence also reveals the limits of that belief.
Notice what he actually built: Windows utilities, a screen recorder, and two Firebase-backed CRUD apps for his wife’s business. These are real applications solving real problems, but they’re also the kinds of projects where you can throw away the code if something goes wrong. When he says “I don’t know how the code works” and “I’m maybe 80% confident these applications are bulletproof,” he’s admitting the core problem with the “AI replaces developers” narrative.
That 80% confidence matters. In your Splink work, you’re the sole frontend developer - you can’t deploy code you’re 80% confident about. You need to understand the implications of your architectural decisions, know where the edge cases are, and maintain the system when requirements change. Burke’s building throwaway prototypes for his wife’s yard sign business. You’re building production software that other people depend on.
His “LLM-first code” philosophy is interesting but backwards. He’s optimizing for AI regeneration rather than human maintenance because he assumes the AI will always be there to fix problems. But AI can’t tell you why a decision was made six months ago when business requirements shift. It can’t explain the constraints that led to a particular architecture. And it definitely can’t navigate political and organizational context when stakeholders disagree about priorities.
The Firebase examples are telling - he keeps emphasizing how well Opus knows the Firebase CLI, as if that proves general capability. But Firebase is extremely well-documented, widely-discussed training data. Try that same experiment with your company’s internal API or a niche library with poor documentation. The model won’t be nearly as capable.
What Burke actually demonstrated is that Opus 4.5 is an excellent pair programmer for prototyping with well-known tools. That’s legitimately valuable. But “pair programmer for prototyping” isn’t the same as “replacing developers.” It’s augmenting someone who already knows how to build software and can evaluate whether the generated code is good.
The most revealing line is at the end: “Just make sure you know where your API keys are.” He’s nervous about security because he doesn’t understand the code. That nervousness is appropriate - it’s the signal that tells you when you’ve crossed from useful tool into dangerous territory.

by Staross

0 subcomment

I gave it a try, I asked to do a reddit like forum and it did pretty good but damn I quickly hit the daily limit of the $20 pro account, and it took 10% of the monthly just to do the setup and some basics. I knew LLM were expensive to run but I've never felt it directly. Even if the code is good it's kinda expensive for what you get.
Ho it was also quite funny it used the exact same color as hackernews and a similar layout.

by thedangler

0 subcomment

I've only started but I mostly use Claude Code for building out code that has been done a million times. So its good at setting up a project to get all the boiler plate crap out of the way.
When you need to build out specific feature or logic, it can fail hard. And the best is when you have something working, and it fixes something else and deletes the old code that was working, just in a different spot.

0 subcomment

by brushfoot

0 subcomment

I pivoted into integrations in 2022. My day-to-day now is mostly in learning the undocumented quirks of other systems. I turn those into requirements, which I feed to the model du jour via GitHub Copilot Agents. Copilot creates PRs for me to review. I'd say it gets them right the vast majority of the time now.
Example: One of my customers (which I got by Reddit posts, cold calls, having a website, and eventually word of mouth) wanted to do something novel with a vendor in my niche. AI doesn't know how to build it because there's no documentation for the interfaces we needed to use.

by theappsecguy

1 subcomments

It’s incredibly tiring to see this narrative peddled every damn day. I use opus 4.5 every day. It’s not much different than any previous models, still does dumb things all the time.

by kace91

0 subcomment

>Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5
Either it wasn’t that good, or the author failed in the one phrase they didn’t proofread.
(No judgement meant, it’s just funny).

by squirrellous

0 subcomment

For some reason Opus 4.5 is blowing up recently after having been released for weeks. I guess because holidays are over? Active agent users should have discovered this for a while.

by p0w3n3d

0 subcomment

```
  Disclaimer: This post was written by a human and edited for spelling, grammer by Haiku 4.5
```
I recently am finishing the reading of Mistborn series, so please do not read further unless you want a spoiler.
```
  SPOILER
```
There is a suspicion that mists can change written text.
```
  END OF SPOILER
```
So how can we be sure that Haiku didn't change the text in favour of AI then?

by vl

1 subcomments

Honestly, I don’t understand universal praise for Opus 4.5. It’s good, but really not better than other agents.
Just today:
Opus 4.5 Extended Thinking designed psql schema for “stream updates after snapshot” with bugs.
Grok Heavy gave correct solution without explanations.
ChatGPT 5.2 Pro gave correct solution and also explained why simpler way wouldn’t work.

by yardie

1 subcomments

These are very simple utilities. I expect AI to be able to build them easily. Maybe in a few years it will be able to write a complete photo editor or CAD application from first principles.

by dudeinhawaii

0 subcomment

LLMS like Opus, Gemini 3, and GPT-5.2/5.1-Codex-max, are phenomenal for coding and have only very recently crossed that gap between being "eh" and being quite fantastic to let operate on their own agentically. The major trade-off being a fairly expensive cost. I ran up $200 per provider after running through 'pro' tier limits during a single week of hacking over the holidays.
Unfortunately, it's still surprisingly easy for these models to fall into really stupid maintainability traps.
For instance today, Opus adds a feature to the code that needs access to a db. It fails because the db (sqlite) is not local to the executable at runtime. Its solution is to create this 100 line function to resolve a relative path and deal with errors and variations.
I hit ESC and say "... just accept a flag for --localdb <file>". It responds with "oh, that's a much cleaner implementation. Good idea!". It then implements my approach and deletes all the hacks it had scattered about.
This... is why LLMs are still not Senior engineers. They do plainly stupid things. They're still absurdly powerful and helpful, but if you want maintainable code you really have to pay attention.
Another common failure is when context is polluted.
I asked Opus to implement a feature by looking up the spec. It looked up the wrong spec (a v2 api instead of a v3) -- I had only indicated "latest spec". It then did the classic LLM circular troubleshooting as we went in 4 loops trying to figure out why calculations were failing.
I killed the session, asked a fresh instance to "figure out why the calculation was failing" and it found it straight away. The previous instance would have gone in circles for eternity because its worldview had been polluted by assumptions made -- that could not be shaken.
This is a second way in which LLMs are rigid and robotic in their thinking and approach -- taking the wrong way even when directed not to. Further reading on 'debugging decay': https://arxiv.org/abs/2506.18403
All this said, the number of failure scenarios gets ever smaller. We've gone from "problem and hallucination every other code block" to "problem every 200-1000 code blocks".
They're now in the sweet spot of acting as a massive accelerator. If you're not using them, you'll simply deliver slower.

by weatherlite

0 subcomment

The main issue in this discussion is the word "replace" . People will come up with a bunch of examples where humans are still needed in SWE and can't be fully replaced, that is true. I think claiming that 100% of engineers would be replaced in 2026 is ridiculous. But how about downsizing? Yeah that's quite probable.

by manmal

1 subcomments

IMO codex produces working code slowly, while Opus produces superficially working code quickly. I like using Opus to drive codex sessions and checking its output. Clawdbot is really good at that but a long running Claude Code session with codex as sub agents should work well also.
The above is for vibe coding; for taking the wheel, I can only use Opus because I suck at prompting codex (it needs very specific instructions), and codex is also way too slow for pair programming.

by minimaxir

0 subcomment

See also: a post from a couple days ago which came to the same conclusion that Opus 4.5 is an inflection point above Sonnet 4.5 despite that conclusion being counterintuitive: https://news.ycombinator.com/item?id=46495539
It's hard to say if Opus 4.5 itself will change everything given the cost/latency issues, but now that all the labs will have very good synthetic agentic data thanks to Opus 4.5, I will be very interested to see what the LLMs release this year will be able to do. A Sonnet 4.7 that can do agentic coding as well as Opus 4.5 but at Sonnet's speed/price would be the real gamechanger: with Claude Code on the $20/mo plan, you can barely do more than one or two prompts with Opus 4.5 per session.

by elendee

0 subcomment

the author asks one interesting question and then glides right by it. If the agents only need their own code, what should that code look like? If all their learning has come from old human code, how will that change in the future as the ecosystem fills up with agent code?

by karmasimida

0 subcomment

As impressive as Opus 4.5 is, it still fails in one situation that it assumes 0-index while the component it supposes to work with assume 1-index. It has access to the said information on disk, but just forgets to look into.
Opus 4.5 is incredible, it is the GPT-4 moment for coding because how honest and noticeable the capacity increase is. But still, it has blind spots just like human.

by lifetimerubyist

0 subcomment

Opus helped me optimized a wonky SQL query today from 4s to 5min. Truly something that only a super intelligence is capable of.

by jcmfernandes

1 subcomments

To the author: you wrote those apps. Not like you used to, but you wrote them.
IMO, our jobs are safe. It's our ways of working that are changing. Rapidly.

by hsn915

0 subcomment

I had a similar feeling expressed in the title regarding ChatGPT 5.2
I haven't tried it for coding. I'm just talking about regular chatting.
It's doing something different from prior models. It seems like it can maintain structural coherence even for very long chats.
Where as prior models felt like System 1 thinking, ChatGPT5.2 appears like it exhibits System 2 thinking.

by killerstorm

0 subcomment

Weird title. Obviously, early AI agents were clumsy, and we should expect more mature performance in future.
Leopold Aschenbrenner was talking about "unhobbling" as an ongoing process. That's what we are seeing here. Not unexpected

by SergeAx

0 subcomment

This article is much better than hundred of similar articles "AI will change software engineering" because it have links to actual products created with said "AI". I can't say they are impressive, but definitely so for laypeople.

by jackdoe

0 subcomment

most of software engineering was rational, now it is becoming empirical
it is quite strange, you have to make it write the code in a way it can reason about it without it reading it, you also have to feel the code without reading all of it. like a blind man feeling the shape of an object; Shape from Darkness
you can ask opus to make a car, it will give you a car, then you ask it for navigation; no problem, it uses google maps works perfect
then you ask it to improve the breaks, and it will give internet to the tires and the break pedal, and the pedal will send a signal via ipv6 to the tires which will enable a very well designed local breaking system, why not, we already have internet for google maps.
i think the new software engineering is 10 times harder than the old one :)

by Snuggly73

0 subcomment

Ok, if its almighty, then why is not the benchmarks at 100%? If you look at the individual issues, those are somewhat small and trivial changes in existing codebases.
https://swe-rebench.com/
(note that if you look at individual slices, Opus is getting often outperformed by Sonnet).

by infinitezest

1 subcomments

The question I keep asking myself is "how feasible will any of this be when the VC money runs out?" Right now tokens are crazy cheap. Will the continue to be?

by nsb1

1 subcomments

A lot of the complaints about these tools seems to revolve around their current lack of ability to innovate for greenfield or overly complex tasks. I would agree with this assessment in their current state, but this sentiment of "I will only use AI coding tools when they can do 100% of my job" seems short-sighted.
The fact of the matter, in my experience, is that most of the day to day software tasks done by an individual developer are not greenfield, complex tasks. They're boring data-slinging or protocol wrangling. This sort of thing has been done a thousand times by developers everywhere, and frankly there's really no need to do the vast majority of this work again when the AIs have all been trained on this very data.
I have had great success using AIs as vast collections of lego blocks. I don't "vibe code", I "lego code", telling the AI the general shape and letting it assemble the pieces. Does it build garbage sometimes? Sure, but who doesn't from time to time? I'm experienced enough notice the garbage smell and take corrective action or toss it and try again. Could there be strange crevices in a lego-coded application that the AI doesn't quite have a piece for? Absolutely! Write that bit yourself and then get on with your day.
If the only thing you use these tools for is doing simple grunt-work tasks, they're still useful, and dismissing them is, in my opinion, a mistake.

by sachahjkl

0 subcomment

Yowza, AIs excel at writing low performance CRUD apps, REVOLUTION INCOMING

by Sxubas

0 subcomment

Just an open thought, what if most improvement we are seeing is not mostly due to LLM improvements but to context management and better prompting?
Ofc the reality is a mix of both, but really curious on what contributes more.
Probably just using cursor with old models (eww) can yield a quick response.

by waynenilsen

2 subcomments

Once you get your setup bulletproof such that you can have multiple agents running at the same time that can run unit tests and close their own loops things get even faster. However you accomplish that. Not as easy as it sounds mostly (and absurdly) due to port collision. E2E testing with playwright is another leap.

by MORPHOICES

1 subcomments

Title: Ask HN: How do you evaluate claims of “this model changes everything” in practice?
The release of every big model seems to carry the identical vibe: finally, this one crossed the line. The greatest programmer. The end of workflows and their meaning.
I’ve learned to slow myself down and ask a different question. What has changed in my day-to-day work after two weeks?
I currently make use of a filter with roughness.
Did it really solve a problem, or did it just make easy parts easier?
Has it lessened the number of choices or has it created new ones?
Have my review responsibilities decreased or increased?
Some things feel revolutionary on day one and then quietly fade into something that’s nice to have. Others barely wow, but stay around. ~
For those who've experienced a couple of cycles.
What indicators suggest that an upcoming release will be significant?
When do you alter your workflow, after how long?

by jdthedisciple

1 subcomments

To those of you who use it: How much does Claude Code cost you a month on avg?
I only use VS Code with Copilot subscription ($10) and already get quite a lot out of it.
My experience is that Claude Code really drains your pocket extremely fast.

by alex1138

0 subcomment

Are the LLMs in any way trained semantically or by hooks that you can plug in, say, Python docs? And if a new version of Python then gets released then the training data changes, etc

by _se

1 subcomments

It's always fun to ask Opus what it thinks about articles like this. Here's what I got with no history or system prompt:
https://burkeholland.github.io/posts/opus-4-5-change-everyth...
Read this article and ultrathink critically about it. Provide your perspective.
The article makes a strong experiential case for improved velocity in AI-assisted development but contains several argumentation weaknesses and conflations worth examining.
The projects described are legitimately non-trivial: Firebase backend integration, Facebook OAuth, iOS apps in Swift (a language the author doesn't know), GitHub Actions pipelines, scheduled cloud functions. Getting these working in hours rather than weeks represents a real capability shift. The author is honest about his limitations and uncertainties, particularly the security concerns.
Where the argument breaks down:
1. "Replace developers" vs "dramatically augment developers" The author's own workflow contradicts the headline claim. He's still:
Making architectural decisions (choosing Firebase) Handling errors Opus couldn't see (XAML via Visual Studio) Writing custom prompts to shape output quality Manually auditing security Making product and UX decisions
This is developer work. The tool changed; the role didn't disappear.
2. The 80% security confidence undermines his thesis He admits he's shipping apps with "80% confidence" in security and calls it "too damn low." This is the crux: the AI accelerated production but didn't replace the judgment required to responsibly ship production software. The velocity gain exposed a competence gap rather than closing it.
3. Sample bias in project selection
All examples are:
Greenfield (no existing codebase) Single developer Personal/family use Standard patterns with excellent documentation (Firebase, SwiftUI, React Native) No regulatory, compliance, or scale requirements No team collaboration or handoff considerations
These constraints define a specific problem space where AI excels. Extrapolating to "developers are replaceable" ignores the majority of professional software work.
4. "Code doesn't need human readability" is underbaked His argument is circular: "Why optimize for human readability when the AI is doing all the work?" But:
His 80% security confidence exists because he can't read the code He had to use external tools (VS) when Opus couldn't diagnose errors What happens when context windows are exceeded and the LLM loses track? Model behavior changes between versions; human-readable code is version-agnostic
The custom prompt he shares actually encodes many good engineering practices (minimal coupling, explicit state, linear control flow) that benefit LLMs and humans. The "no comments needed" claim conflates what's optimal for LLM regeneration with what's optimal for debugging production issues at 3am. What's actually being demonstrated
The honest version of this article would be: Opus 4.5 dramatically compresses the gap between "can write code" and "can ship a personal app" for a specific class of greenfield projects. That's genuinely transformative for hobbyists, indie developers, and people solving their own problems. But that's different from "replacing developers." The article demonstrates a power tool; power tools don't eliminate tradespeople.

by scotty79

0 subcomment

> And if it ran into errors, it would try and build using the dotnet CLI, read the errors and iterate until fixed.
Antigravity with Gemini 3 pro from Google has the same capability.

by orthoxerox

1 subcomments

What's the best coding agent you can run locally? How far behind Opus 4.5 is it?

by arielweisberg

0 subcomment

I agree. Claude Code went from being slower than doing it myself to being on average faster, but also far less exhausting so I can do more things in general while it works.

by on_the_train

1 subcomments

Oh another run of new small apps. Why not unleash this oh so powerful tools not on a jira ticket written two years ago, targeting 3 different repos in an old legacy moloch, like actual work?
It's always just the "Fibonacci" equivalent

by _pdp_

0 subcomment

YEP
Things are changing. Now everyone can build bespoke apps. Are these apps pushing the limits of technology? No! But they work for the very narrow and specific domain they where designed. And yes they do not scale and have as much bugs as your personal shell scripts. But they work.
But let's not compare these with something more advance - at least not yet. Maybe by end of this year?
We switched from Sonnet 4.5 to Opus 4.5 as our default coding agent recently and we pay the price for the switch (3x the cost) but as the OP said, it is quite frankly amazing. It does a pretty good job, especially, especially when your code and project is structured in a such a way that it helps the agent perform well. Anthropic released an entire video on the subject recently which aligns with my own observations as well.
Where it fails hard is in the more subtle areas of the code, like good design, best practices, good taste, dry, etc. We often need to prompt it to refactor things as the quick solution it decided to do is not in our best interest for the long run. It often ends in deep investigations about things which are trivially obvious. It is overfitted to use unix tools in their pure form as it fail to remember (even with prompting) that it should run `pnpm test:unit` instead `npx jest` - it gets it wrong every time.
But when it works - it is wonderful.
I think we are at the point where we are close to self-improving software and I don't mean this lightly.
It turns out the unix philosophy runs deep. We are right now working on ways to give our agents more shells and we are frankly a few iterations there. I am not sure what to expect after this but I think whatever it is, it will be interesting to see.

by adithyassekhar

0 subcomment

I like writing code

by Papazsazsa

2 subcomments

"Opus 4.5 feels to me like"
The article is fine opinion but at what point are we going to either:
a) establish benchmarks that make sense and are reliable, or
b) stop with the hypecycle stuff?

by chris_st

0 subcomment

I've found asking GPT-5.2 High to review Opus 4.5's code to be really productive. They find different things.

by ironbound

0 subcomment

This is great can't wait for the future when our VC ideas can become unicorns, without CEO's & Founders..

by Fischgericht

1 subcomments

People should finally understand that LLMs are a lossy database of PAST knowledge. Yes, if you throw a task at it that has been done tons of times before, it works. Which is not a surprise, because it takes minutes to Google and index multiple full implementations of "Tool that allows you to right-click on an image to convert it". Without LLM you could do the same: Just copy&paste the implementation of that from Microsoft Powertoys, for example.
What LLMs will NOT do however, is write or invent SOMETHING KNEW.
And parts of our industry still are about that: Writing Software that has NOT been written before.
If you hire junior developers to re-invent the wheels: Sure, you do not need them anymore.
But sooner or later you will run out of people who know how to invent NEW things.
So: This is one more of those posts that completely miss the point. "Oh wow, if I look up on Wikipedia how to make pancakes I suddenly can make and have pancakes!!!1". That always was possible. Yes, you now can even get an LLM to create you a pancake-machine. Great.
Most of the artists and designers I am friends with have lost their jobs by now. In a couple of years you will notice the LLMs no longer have new styles to copy from.
I am all for the "remix culture". But don't claim to be an original artist, if you are just doing a remix. And LLM source code output are remixes, not original art.

by jcadam

0 subcomment

Yea, my issue with Opus 4.5 is it's the first model that's good enough that I'm starting to feel myself slip into laziness. I catch myself reviewing its output less rigorously than I had with previous AI coding assistants.
As a side project / experiment, I designed a language spec and am using (mostly) Opus 4.5 to write a transpiler (language transpiles to C) for it. Parser was no problem (I used s-expressions for a reason). The type checker and transpiler itself have been a slog - I think I'm finding the limits of Opus :D. It particularly struggles with multi-module support. Though, some of this is probably mistakes made by me while playing architect and iterating with Claude - I haven't written a compiler since my senior year compiler design course 20+ years ago. Someone who does this for a living would probably have an easier time of it.
But for the CRUD stuff my day job has me doing? Pffttt... it's great.

0 subcomment

by thallukrish

0 subcomment

When complexity increases, you end up handholding them in pieces.

by ggregoire

1 subcomments

I'm always surprised to never see any comments in those discussions from people who just like coding, learning, solving problems… I mean, it's amazing that LLMs can build an image converter or whatever you dream of, in a language you don't know, in a field you are not familiar with, in 1 hour, for 30 cents… I'm sure your boss and shareholders love it. But where is the fun in that? For me it kills any interest in doing what I'm doing. I'm lucky enough to work in a place where using LLMs is not mandatory (yet), I don't know how people can make it through the day just writing prompts and reviewing AI slop.

0 subcomment

by vladsh

5 subcomments

It’s a bit strange how anecdotes have become acceptable fuel for 1000 comment technical debates.
I’ve always liked the quote that sufficiently advanced tech looks like magic, but its mistake to assume that things that look like magic also share other properties of magic. They don’t.
Software engineering spans over several distinct skills: forming logical plans, encoding them in machine executable form(coding), making them readable and expandable by other humans(to scale engineering), and constantly navigating tradeoffs like performance, maintainability and org constraints as requirements evolve.
LLMs are very good at some of these, especially instruction following within well known methodologies. That’s real progress, and it will be productized sooner than later, having concrete usecases, ROI and clearly defined end user.
Yet, I’d love to see less discussion driven by anecdotes and more discussion about productizing these tools, where they work, usage methodologies, missing tooling, KPIs for specific usecases. And don’t get me started on current evaluation frameworks, they become increasingly irrelevant once models are good enough at instruction following.

by dmarwicke

0 subcomment

this is just optimizing for token windows. flat code = less context. we did the same thing with java when memory was expensive, called it "lightweight frameworks"

by exabrial

0 subcomment

What is with all the Claude spam lately on hn?

by DustinBrett

0 subcomment

Post the code open source and run it on prod.

by satisfice

0 subcomment

Doing things for your own use, where you are taking all the risks, is perfectly fine.
As soon as you try to sell it to me, you have a duty of care. You are not meeting that duty of care with this ignorant and reckless way of working.

by bluelightning2k

0 subcomment

The harness here was Claude Code?

by rubzah

0 subcomment

Once again. It is not greenfield projects most of us want to use AI coding assistance for. It is for an existing project, with a byzantine mess of a codebase, and even worse messes of infrastructure, business requirements, regulations, processes, and God knows what else. It seems impossible to me that AI would ever be useful in these contexts (which, again, are practically all I ever deal with as a professional in software development).

by pyuser583

2 subcomments

Opus 4.5 burns through tokens really fast.

by DGAP

0 subcomment

Time to get a new job.

by overgard

1 subcomments

Ugh, I'm so sick of these "I can use AI to solve an already solved problem, thus programmers aren't relevant." Note the solved problem part. This isn't convincing except to people that want a (bad) argument to depress wages and lay off workers while making the existing seniors take on more and more work. This is overall bad for the industry.

by lawlessone

1 subcomments

Blogspam.

by bigcloud1299

0 subcomment

Oh shit your UI looks exactly 100% like mine.

by danfritz

12 subcomments

Every time I see a post like this on HN I try again and every time I come to the same conclusion. I have never see one agent managing to pull something off that I could instantly ship. It still ends up being very junior code.
I just tried again and ask Opus to add custom video controls around ReactPlayer. I started in Plan mode which looked overal good (used our styling libs, existing components, icons and so on).
I let it execute the plan and behold I have controls on the video, so far so good. I then look at the code and I see multiple issues: Over usage of useEffect for trivial things, storing state in useState which should be computed at run time, failing to correctly display the time / duration of the video and so on...
I ask follow up question like: Hide the controls after 2 seconds and it starts introducing more useEffects and states which all are not needed (granted you need one).
Cherry on the cake, I asked to place the slider at the bottom and the other controls above it, it placed the slider on the top...
So I suck at prompting and will start looking for a gardening job I guess...

by drchiu

0 subcomment

Having used Opus 4.5 for the past 5 weeks, I estimate it codes better than 95% of the people I've ever worked with.
And it writes with more clarity too.
The only people who are complaining about "AI slop" are those whose jobs depend on AI to go away (which it won't).

by skerit

0 subcomment

Ah, another thread filled with people sharing anecdotes about how they asked Claude to one-shot an entire project that would take people weeks if not months.

by vivzkestrel

1 subcomments

- does it understand the difference between eslint 8x and eslint 9.x?
- or biome 1.x and biome 2.x ?
- nah! it never will and that is why it ll never replace mid level engineers, FTFY

by hollowturtle

13 subcomments

I'm tired of constantly debating the same thing again and again. Where are the products? Where is some great performing software all LLM/agent crafted? All I see is software bloatness and decline. Where is Discord that uses just a bunch of hundreds megs of ram? Where is unbloated faster Slack? Where is the Excel killer? Fast mobile apps? Browsers and the web platform improved? Why Cursor team don't use Cursor to get rid of vscode base and code its super duper code editor? I see tons of talking and almost zero products.

by kypro

3 subcomments

It's been interesting watching HN shift in my direction on this in recent weeks...
I had been saying since around summer of this year that coding agents were getting extremely good. The base model improvements were ok, but the agentic coding wrappers were basically game changers if you were using them right. Until recently they still felt very context limited, but the context problem increasingly feels like a solved problem.
I had some arguments on here in the summer about how it was stupid to hire junior devs at this point and how in a few years you probably wouldn't need senior devs for 90% of development tasks either. This was an aggressive prediction 6 months ago, but I think it's way too conservative now.
Today we have people at our company who have never written code building and shipping bespoke products. We've also started hiring people who can simply prove they can build products for us using AI in a single day. These are not software engineers because we are paying them wages no SWEs would accept, but it's still a decent wage for a 20 something year old without any real coding skills but who is interested in building stuff.
This is something I wouldn't have never of expected to be possible 6 months ago. In 6 months we've gone from senior developers writing ~50% of their code with AI, to just a handful of senior developers who now write close to 90% of their code with AI while they support a bunch of non-developers pumping out a steady stream of shippable products and features.
Software engineers and traditional software engineer is genuinely running on borrowed time right now. It's not that there will be no jobs for knowledgable software engineers in the coming years, but companies simply won't need many hotshot SWEs anymore. The companies that are hiring significant numbers of software engineers today simply can not have realised how much things have changed over just the last few months. Apart from the top 1-2% of talent I simply see no good reason to hire a SWE for anything anymore. And honestly outside of niche areas, anyone hand-cracking code today is a dinosaur... A good SWE today should see their job as simply reviewing code and prompting.
If you think that the quality of code LLMs produce today isn't up to scratch you've either not used the latest models and tools or you're using them wrong. That's not to say it's the best code – they still have a tendency to overcomplicate things in my opinion – but it's probably better than the average senior software engineer. And that's really all that matters.
I'm writing this because if you're reading this thinking we're basically still in 2024 with slightly better models and tooling you're just wrong and you're probably not prepared for what's coming.

by yolkedgeek

0 subcomment

I really can't tell if this is satire or not

by kelseyfrog

0 subcomment

Can it pre-emptively write the HN comment where someone says it utterly fails for them but no one else is able to reproduce?

by gogasca

0 subcomment

[dead]

by halfmatthalfcat

0 subcomment

[flagged]

by hannofcart

0 subcomment

To the sceptics still saying that LLMs still can't solve "slime mold pathing algorithm and creating completely new shoe-lacing patterns" (literally a quote from a different comment here), please consider something we've learnt over and over again in history: good enough and cheap will destroy perfect but expensive.
And then cheap and good enough option will eventually get better because that's the one that is more used.
It's how Japanese manufacturing beat Western manufacturing. And how Chinese manufacturing then beat Japanese again.
It's why it's much more likely you are using the Linux kernel and not GNU hurd.
It's how digital cameras left traditional film based cameras in the dust.
Bet on the cheaper and good enough outcome.
Bet against it at your peril.

by nikisil80

6 subcomments

[flagged]