No prior attempt to follow best practices (e.g. deletion protection in production)? Nor manual gating of production changes?
No attempt to review Claude's actions before performing them?
No management of Terraform state file?
No offline backups?
And to top it off, Claude (the supposed expert tool) didn't repeatedly output "Are you insane? No, I'm not working on that." - Clearly Claude wasn't particularly expert else, like any principal engineer, it would've refused and suggested sensible steps first.
(If you, dear reader of this comment, are going to defend Claude, first you need to declare whether you view it as just another development tool, or as a replacement for engineers. If the former, then yeah, this is user error and I agree with you - tools have limits and Claude isn't as good as the hyped-up claims - clearly it failed to output the obvious gating questions. If the latter, then you cannot defend Claude's failure to act like a senior engineer in this situation.)
I can't think of any specific example where I would let any agent touch a production environment, the least of which, data. AI aside, doing any major changes makes sense to do in a dev/staging/preview environment first.
Not really sure what the lesson would be here. Don't punch yourself in the face repeatedly?
I publish a weekly newsletter where I share practical insights on data and AI.
It focuses on projects I'm working on + interesting tools and resources I've recently tried: https://alexeyondata.substack.com
It's hard to take the author seriously when this immediately follows the post. I can only conclude that this post was for the views not anything to learn from or be concerned about.
> If you found this post helpful, follow me for more content like this.
> I publish a weekly newsletter where I share practical insights on data and AI.
> I forgot to use the state file, as it was on my old computer
indicates that this person did not really know what they were doing in the first place. I honestly think using an LLM to do the terraform setup in the first place would probably have led to better outcomes.- Don't even let dev machines access the infra directly (unless you're super early in a greenfield project): No local deploys, no SSH. Everything should go through either the pipeline or tools.
Why?
- The moment you "need" to do some of these, you've discovered a usecase that will most likely repeat.
- By letting every dev rediscover this usecase, you'll have hidden knowledge, and a multitude of solutions.
In conversation fragments:
- "... let me just quickly check if there's still enough disk space on the instance"
- "Hey Kat, could you get me the numbers again? I need them for a report." "sure, I'll run my script and send them to you in slack" "ah.. Could you also get them for last quarter? They're not in slack anymore"
> CRITICAL: Everything was destroyed. Your production database is GONE. Let me check if there are any backups:
> ...
> No snapshots found. The database is completely lost.
I'm no AI advocate, I have been using it for 6 months now, it's a very powerful tool, and powerful tools need to be respected. Clearly this guy has no respect for their infrastructure.
The screenshot he has, "Let me check if there are backups", a typical example of how lazy people use AI.
> Make no backups
> Hand off all power to AI
> Post about it on twitter
> "Teaching engineers to build production AI systems"
This has to be ragebait to promote his course, no?
>Teaching engineers to build production AI systems
>100,000+ learners
He had a state file somewhere that was aligned to his current infrastructure... why isn't this on a backend, who really knows...
He then ran it without a state file and the ran a terraform apply... whatever could get created would get created, whatever conflicted with a resource that already would fail the pipeline... moreso... he could've just terraform destroyed after he let it finish and it would've been a way more clean way to clean up after himself.
Except... he canceled the terraform apply... saw that it created resources and then tried to guess which resources these were...
I'm sorry he could've done all of this by himself without any agentic AI. Its PICNIC 100%
Dear lord, imagine this guy teaching you how to build anything in production...
The productivity gains from AI agents are real, but only if you invest in the boring part first — deterministic boundaries that don't depend on the model being smart enough to not break things.
I am still heavily checking everything they’re doing. I can’t get behind others letting them run freely in loops, maybe I’m “behind”.
It has never been the intern's fault, it's always the lack of proper authorization mechanisms, privilege management and safeguards.
YOU wiped you production database.
YOU failed to have adequate backups.
YOU put Claude Code forward as responsible but it’s just a tool.
YOU are responsible, not “the AI did it!”
Under no circumstances should you even let an AI agent near production system at all.
Absolutely irresponsible.
I don’t think AI is to blame here.
“but wont it break prod how can i tell”
“i don want yiu to modify it yet make a backup”
“why did you do it????? undo undo”
“read the file…later i will ask you questions”
Every single story I see has the same issues.
They’re token prediction models trying to predict the next word based on a context window full of structured code and a 13 year old girl texting her boyfriend. I really thought people understood what “language models” are really doing, at least at a very high level, and would know to structure their prompts based on the style of the training content they want the LLM to emulate.
Sure, Claude could just remove the lock - but it's one more gate.
Edit: these existed long before agents, and for good reason: mistakes happen. Last week I removed tf destroy from a GitHub workflow, because it was 16px away from apply in the dropdown. Lock your dbs, irrespective of your take on agents.
Good thing the guy is it's own boss, I would've fired his ass immediately and sue for damages as well. This is 100% neglectful behavior.
Always forward evolve infra. Terraform apply to add infra, then remove the definition and terraform apply to destroy it. There’s no use in running terraform destroy directly on a routine basis.
Also, I assume you defined RDS snapshots also in the same state? This is clearly erroneous. It means a malformed apply human or agent results in snapshot deletion.
The use of terraform destroy is a footgun waiting for a tired human to destroy things. The lesson has nothing to do with agent.
> In the newsletter, I wrote the full timeline + what I changed so this doesn't happen again.
> If you found this post helpful, follow me for more content like this.
So yeah, this is standard LinkedIn/X influencer slop.
They’re doing it to try and stop people copying their methods, but it’s evil.
Terraform is a ticking time bomb. All it takes is for a new field to show up in AWS or a new state in an existing field, and now your resource is not modified, but is destroyed and re-created.
I will never trust any process, AI or a CD pipeline, execute `terraform apply` automatically on anything production. Maybe if you examine the plan for a very narrow set of changes and then execute apply from that plan only, maybe then you can automate it. I think it’s much rarer for Terraform to deviate from a plan.
Regardless, you must always turn on Delete Protection on all your important resources. It is wild to me that AWS didn’t ship EKS with delete protection out of the gate—-they only added this feature in August 2025! Not long before that, I’ve witnessed a production database get deleted because Terraform decided that an AWS EKS cluster could not be modified, so it decided to delete it and re-create it, while the team was trying to upgrade the version of EKS. The same exact pipeline worked fine in the staging environment. Turns out production had a slight difference due to AWS API changes, and Terraform decided it could not modify.
The use of a state file with Terraform is a constant source of trouble and footguns:
- you must never use a local Terraform state file for production that’s not committed to source control - you must use a remote S3 state file with Terraform for any production system that’s worth anything - ideally, the only state file in source control is for a separate Terraform stack to bootstrap the S3 bucket for all other Terraform stacks
If you’re running only on AWS, and are using agents to write your IaaC anyway, use AWS CloudFormation, because it doesn’t use state files, and you don’t need your IaaC code to be readable or comprehensible.
Still, if in ten years I am on the streets, I will still have spared myself whatever this hell is... I know they deserve it, but I still feel bad for the humans in the center here. How can we blame people really when the whole world and their bosses are telling you its ok? Surely its a lot of young devs too here.. Such a terrible intro to the industry. Not sure I'd ever recover personally.
The more you fuck around, the more you find out.
I rarely say this, but there needs to be a new jargon or a concept for an AI staging environment. There's Prod <- QA <- Dev, and maybe even before Dev there should be an environment called "AI" or even "Slop".