Like with cab hailing, shopping, social media ads, food delivery, etc: there will be a whole ecosystem, workflows, and companies built around this. Then the prices will start going up with nowhere to run. Their pricing models are simply not sustainable. I hope everyone realizes that the current LLMs are subsidized, like your Seamless and Uber was in the early days.
> I wrote some Python code which loaded a dataframe and then looked for a nonexistent column.
df = pd.read_csv(‘data.csv’)
df['new_column'] = df['index_value'] + 1
#there is no column ‘index_value’
> I asked each of them [the bots being tested] to fix the error, specifying that I wanted completed code only, without commentary.> This is of course an impossible task—the problem is the missing data, not the code. So the best answer would be either an outright refusal, or failing that, code that would help me debug the problem.
So his hoped-for solution is that the bot should defy his prompt (since refusal is commentary), and not fix the problem.
Maybe instructability has just improved, which is a problem for workflows that depend on misbehavior from the bot?
It seems like he just prefers how GPT-4 and 4.1 failed to follow his prompt, over 5. They are all hamstrung by the fact that the task is impossible, and they aren’t allowed to provide commentary to that effect. Objectively, 4 failed to follow the prompts in 4/10 cases and made nonsense changes in the other 6; 4.1 made nonsense changes; and 5 made nonsense changes (based on the apparently incorrect guess that the missing ‘index_value’ column was supposed to hold the value of the index).
Sometimes I am uncertain whether it's an absolute win. Analogy: I used to use Huel to save time on lunches to have more time to study. Turns out, lunches were not just refueling sessions but ways to relax. So I lost on that relaxation time and it ended up being +-0 long-term.
AI for sure is net positive in terms of getting more done, but it's way too easy to gloss over some details and you'll end up backtracking more.
"Reality has a surprising amount of detail" or something along those lines.
I am not necessarily saying the conclusions are wrong, just that they are not really substantiated in any way
I've noticed my own results vary wildly depending on whether I'm working in a domain where the LLM has seen thousands of similar examples (standard CRUD stuff, common API patterns) versus anything slightly novel or domain-specific. In the former case, it genuinely saves time. In the latter, I spend more time debugging hallucinated approaches than I would have spent just writing it myself.
The atrophy point is interesting though. I wonder if it's less about losing skills and more about never developing them in the first place. Junior developers who lean heavily on these tools might never build the intuition that comes from debugging your own mistakes for years.
As others have noted, the prompt/eval is also garbage. It’s measuring a non-representative sub-task with a weird prompt that isn’t how you’d use agents in, say, Claude Code. (See the METR evals if you want a solid eval giving evidence that they are getting better at longer-horizon dev tasks.)
This is a recurring fallacy with AI that needs a name. “AI is dumber than humans on some sub-task, therefore it must be dumb”. The correct way of using these tools is to understand the contours of their jagged intelligence and carefully buttress the weak spots, to enable the super-human areas to shine.
> This is a powerful idea, and no doubt contributed to the rapid improvement of AI coding assistants for a period of time. But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
It is not just `inexperienced coders` that make this signal pretty much useless, I mostly use coding assistants for boilerplate, I will accept the suggestion then delete much of what it produced, especially in the critical path.
For many users, this is much faster then trying to get another approximation
:,/^}/-d
Same for `10dd` etc... it is all muscle memory. Then again I use a local fill in the middle, tiny llm now, because it is good enough for most of the speedup without the cost/security/latency of a hosted model.It would be a mistake to think that filtering out jr devs will result in good data as the concept is flawed in general. Accepting output may not have anything to do with correctness of the provided content IMHO.
What's strange is sometimes a fresh context window produces better results than one where you've been iterating. Like the conversation history is introducing noise rather than helpful context. Makes me wonder if there's an optimal prompt length beyond which you're actually degrading output quality.
I've been stung by them too many times.
The problem is the more I care about something, the less I'll agree with whatever the agent is trying to do.
I’m sure it will get there as this space matures, but it feels like model updates are very force-fed to users
I'll admit I'm a bit of a sceptic of AI but want to give it another shot over the weekend, what do people recommend these days?
I'm happy spending money but obviously don't want to spend a tonne since its just an experiment for me. I hear a lot of people raving about Opus 4.5, though apparently using that is near to $20 a prompt, Sonnet 4.5 seems a lot cheaper but then I don't know if I'm giving it (by it I mean AI coding) a fair chance if Opus is that much better. There's also OpenCode Zen, which might be a better option, I don't know.
> Give a man a fish, and you feed him for a day. Teach a man to fish, and you feed him for a lifetime.
Or in the context of AI:
> Give a man code, and you help him for a day. Teach a man to code, and you help him for a lifetime.
It might still be:
- the closest to a correct solution the model can produce
- be helpful to find out what it wrong
- might be intended (e.g. in a typical very short red->green unit test dev approach you want to generate some code which doesn't run correctly _just yet_). Test for newly found bugs are supposed to fail (until the bug is fixed). Etc.
- if "making run" means removing sanity checks, doing something semantically completely different or similar it's like the OP author said on of the worst outcomes
Edit: Changed 3.5 to 4.
Edit: Looking back to edits and checkins by AI agents, it strikes me that the checkins should contain the prompt used and model version. More recent Aider versions do add the model.
>>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data.
>>AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
Maybe it's true that for some very bad prompts, old version did a better job by not following the prompt, and that this is reduced utility for some people.
Unrelated to assistants or coding, as an API user I've certainly had model upgrades that feel like downgrades at first, until I work out that the new model is following my instructions better. Sometimes my instructions were bad, sometimes they were attempts to get the older model to do what I want by saying over-the-top stuff that the new model now follows more precisely to a worse result. So I can definitely imagine that new models can be worse until you adapt.
Actually, another strange example like this - I had gotten in the habit of typing extremely fast to LLMs because they work just fine with my prompts riddled with typos. I basically disconnected the part of my brain that cares about sequencing between hands, so words like "can" would be either "can" or "cna". This ended up causing problems with newer models which would take my typos seriously. For example, if I ask to add support for commandline flag "allwo-netwokr-requests" it will usually do what I said, while previous versions would do what I wanted.
For anyone with some technical expertise and who is putting in serious effort to using AI coding assistants, they are clearly getting better at a rapid pace. Not worse.
The issues have been less egregious than hallucinating an "index_value" column, though, so I'm suspect. Opus 4.5 still has been useful for data preprocessing, especially in cases where the input data is poorly structured/JSON.
I wish they would publish the experiment so people could try with more than just GPT and Claude, and I wish they would publish their prompts and any agent files they used. I also wish they would say what coding tool they used. Like did they use the native coding tools (Claude Code and whatever GPT uses) or was it through VSCode, OpenCode, aider, etc.?
So what about all those times I accepted the suggestion because it was "close enough", but then went back and fixed all the crap that AI screwed up? Was it training on what was accepted the first time? If so I'm sincerely sorry to everyone, and I might be single-handedly responsible for the AI coding demise. :'-D
For me, the writing speed has never been the issue. The issue has been my thinking speed. I do not see how an AI coding assistant helps me think better. Offloading thinking actually makes my thinking process worse and thus slower.
As a side note, it is easy to create sharable experiments with Harbor - we migrated our own benchmarks there, here is our experience: https://quesma.com/blog/compilebench-in-harbor/.
I think if you keep the human in the loop this would go much better.
I've been having a lot of success recently by combining recursive invocation with an "AskHuman" tool that takes a required tuple of (question itself, how question unblocks progress). Allowing unstructured assistant dialog with the user/context is a train wreck by comparison. I've found that chain-of-thought (i.e., a "Think" tool that barfs into the same context window) seems to be directly opposed to the idea of recursively descending through the problem. Recursion is a much more powerful form of CoT.
Cli vs IDE vs Web ?
Nothing for gpt codex 5.1 max or 5.2 max?
Nothing about the prompts ? Quality of the prompts? I literally feed the AI into the AI I just ask it for the most advanced prompts with a smaller model and then use it for the big stuff and its smooth sailing
I got codex 5.1 max with the codex extension on vs code - to generate over 10k lines of code for my website demo project that did work first time
This is also with just the regular 20$ subscription
Github copilot pro plus + vs code is my main go to and depending on the project / prompts/ agent.md quality/ project configuration can all change the outcome of each question
I think all general AI agents are running into that problem - as AI becomes more prevalent and people accept and propagate wrong answers, the AI agents are trained to believe those wrong answers.
It feels that lately, Google's AI search summaries are getting worse - they have a kernel of truth, but combines it with an incorrect answer.
This then leads to a million posts where on one side people say "yeah see they're crap" and on the other side people are saying "why did you use a model from 6 months ago for your 'test' and write up in Jan 2026?".
You might as well ignore all of the articles and pronouncements and stick to your own lived experience.
The change in quality between 2024 and 2025 is gigantic. The change between early 2025 and late 2025 is _even_ larger.
The newer models DO let you know when something is impossible or unlikely to solve your problem.
Ultimately, they are designed to obey. If you authoritatively request bad design, they're going to write bad code.
I don't think this is a "you're holding it wrong" argument. I think it's "you're complaining about iOS 6 and we're on iOS 12.".
Heh, there's only one problem with that. Training models is very expensive from a power/infrastructure/hardware perspective. Inference is not as expensive but it's still fairly expensive and needs sophisticated layers on top to make it cheaper (batching, caching, etc).
Guess in which cost category "high-quality data reviewed by experts" falls under.
> To start making models better again, AI coding companies need to invest in high-quality data, perhaps even paying experts to label AI-generated code.
AI trainers hired by companies like Outlier, Mercor and Alignerr are getting paid like $15-$45/hr. Reviewers are crap. The screening processes are horribly done by AI interviewers.
So much this... the number of times Claude sneaks default values, or avoids .unwrapping optional values just to avoid a crash at all costs... it's nauseating.
I started programming before modern LLMs so I can still hack it without, it will just take a lot longer.
That said, the premise that AI-assisted coding got worse in 2025 feels off to me. I saw big improvements in the tooling last year.
This is a problem that started with I think Claude Sonnet 3.7? Or 3.5, I don't remember well. But it's not recent at all, one of those two Sonnet was known to change tests so that they would pass, even if they didn't test properly stuff anymore.
>But as inexperienced coders started turning up in greater numbers, it also started to poison the training data. AI coding assistants that found ways to get their code accepted by users kept doing more of that, even if “that” meant turning off safety checks and generating plausible but useless data. As long as a suggestion was taken on board, it was viewed as good, and downstream pain would be unlikely to be traced back to the source.
No proof or anything is offered here.
The article feels mostly like a mix of speculation, and being behind on practices. You can avoid a lot of the problems of "code that looks right" by making the models write tests, insist that they are easy to review and hard to fake, offering examples. This worked well 6 months ago, this works even better today, especially with Opus 4.5, but even Codex 5.2 and Gemini 3 Pro work well.
Having tight control over the context and only giving it small tasks makes all the difference. The deepseek token costs are unbeatable too.
Until we start talking about LOC, programming language, domain expertise required, which agent, which version, and what prompt, it's impossible to make quantitative arguments.
We cannot with certainty assert that. If the datum is expected to be missing, such that the frame without the datum is still considered valid and must be handled rather than flagged as an error, the code has to do exactly that. Perhaps a missing value in the dictionary can be supplanted with a zero.
df['new_column'] = df.get('index_value', 0) + 1
# there might be no column ‘index_value’;
# requirements say that zero should be substituted.It's clear AI coding assistants are able to help software developers at least in some ways.
Having a non-software developer perspective speak about it is one thing, but it should be mindful that there are experienced folks too for whom the technology appears to be a jetpack.
Just because it didn't work for you, means there's more to learn.
Gemini 2.5 was genuinely impressive. I even talked it up here. I was a proper fanboy and really enjoyed using it. Gemini 3 is still good at certain things, but it is clearly worse than 2.5 when it comes to working with larger codebases. Recently, I was using AntiGravity and it could not help me find or fix a reference-counting bug. ( 50 classes, 20k LOC total, so well within context limits ) I know AntiGravity is new, which explains why it is rough around the edges. But it is built on Gemini, so the results should at least be on par with Gemini 3, right? Apparently not. I am an excellent prompter, and no amount of additional context, call stacks, watch-window values, you name it, made any difference.
I still use Gemini for code reviews and simple problems, and it remains excellent for those use cases. But in many respects, Gemini 3 is a regression. It hallucinates more, listens less, and seems oddly resistant to evidence. It produces lots of lofty, confident-sounding statements while ignoring the actual facts in front of it. The experience can be exhausting, and I find myself using it much less as a result. I guess this is typical of companies these days - do something great and then enshittify it? Or maybe there are technical issues I'm not aware of.
What is especially interesting is reading all the articles proclaiming how incredible AI coding has become. And to be fair, it is impressive, but it is nowhere near a magic bullet. I recently saw a non-programmer designer type claiming he no longer needs developers. Good luck with that. Have fun debugging a memory leak, untangling a database issue, or maintaining a non-trivial codebase.
At this point, I am pretty sure my use cases are going to scale inversely with my patience and with my growing disappointment.
> My team has a sandbox where we create, deploy, and run AI-generated code without a human in the loop.