FRESH

Hacker News

Home

I dropped our production database and now pay 10% more for AWS

54 points by dsr12

by grizmaldi

3 subcomments

"Instead of going through the plan manually, I let Claude Code run terraform plan and then terraform apply".
Doesn't matter if it was you or the bot running terraform, the whole point of a two-step process is to confirm the plan looks right before executing the apply. Looking at the plan after the apply is already running is insane.

by oneneptune

2 subcomments

I think people will be quick to engage with the "ai is risky" angle, but the thing that jumps out to me is that you were working against a production state in the first place.
The agent made a mistake that plenty of humans have made. A separate staging environment on real infrastructure goes a long way. Test and document your command sequence / rollout plan there before running it against production. Especially for any project with meaningful data or users.

by fny

2 subcomments

Even though a lot of what people with agents is wreckless, they often build their own guillotine in the process too.
Problem #1: He decided to shoehorn two projects into 1 even though Claude told him not to.
Problem #2: Claude started creating a bunch of unnecessary resources because another archive was unpacked. Instead of investigating this despite his "terror" the author let Claude continue and did not investigate.
Problem #3: He approved "terraform destroy" which obviously nukes the DB! It's clear he didn't understand, and he didn't even have a backup!
> That looked logical: if Terraform created the resources, Terraform should remove them. So I didn’t stop the agent from running terraform destroy

by otterley

1 subcomments

The author is extremely lucky that support was able to find a snapshot for him after he deleted them all. I worked for AWS for many years and was a customer for years before that, and they were almost never able to recover deleted customer data. And this is on purpose: when a customer asks AWS to delete data, they want to assure the customer that it is, in fact, gone. That’s a security promise.
So the fact that they were able to do it for the author is both winning the lottery and frankly a little concerning.
What bothers me more is that the Terraform provider is deleting snapshots that are related to, but not, the database resource itself. Once a snapshot is made, that’s supposed to be decoupled from the database for infrastructure management purposes. That needs to be addressed IMO.
UPDATE: deleting previous automated snapshots on database instance or cluster deletion is default behavior in RDS; that’s not the TF provider’s fault. However, default RDS behavior on deletion is to create a final snapshot of the DB. Makes me wonder if that’s what support helped the author recover. If so, the author didn’t technically need support other than to help locate that snapshot.
And yes this is an object lesson of why human-in-the-loop is still very much needed to check the work of agents that can perform destructive actions.

by kami23

0 subcomment

Props to sharing this!
> Claude was trying to talk me out of it, saying I should keep it separate, but I wanted to save a bit because I have this setup where everything is inside a Virtual Private Cloud (VPC) with all resources in a private network, a bastion for hosting machines
I will admit that I've also ignored Claude's very good suggestions in the past and it has bitten me in the butt.
Ultimately with great automation becomes a greater risk of doing the worst thing possible even faster.
Just thinking about this specific problem makes me more keen to recommend that people have backups and their production data on two different access keys for terraform setups.
I'm not sure how difficult that is I haven't touched terraform in about 7 years now, wow how time flies.

by wackget

4 subcomments

How many users does this website have? It must be relatively tiny.
Why the hell is this anywhere near AWS, or Terraform, or any other PaaS nonsense? I'd wager this thing could be run off a $5 VPS with 30 minutes of setup.

by ugiox

0 subcomment

First let the agent do everything and wrong. But why then continue to use the agent to analyze the problem? That would have been the time to stop using Claude.
And why use an agent at all? For some IaC terraform runs?
What is the problem nowadays that people rather prefer to use non-deterministic actions from an agent instead of the very deterministic cli invocations needed?
I guess these people don’t deserve better. Darwin Award winners.

by sealthedeal

0 subcomment

You should never let Claude manage data in this way. You should if anything have Claude come up with a plan that you manually execute. I get why you would go this path but its pure laziness, and in any normal environment where you weren't the owner you would be terminated and potentially sued for negligence.

by yomismoaqui

1 subcomments

Oh, the missing Terraform state file.
I haven't used Terraform in anger, but when I experimented with it I was scared about the scenario that happened to the original poster.
I thought "it's a footgun but sure I will not execute commands blindly like that", but in the world of clankers seems like this can happen easily.

by Ciantic

1 subcomments

I've used Claude and AWS CDK to build infra code during past year, it is great help but it is not to be trusted. I would not even consider it for Ralph Wiggum Loop style iteration or let alone allowing it to run `cdk deploy` or `cdk destroy`. It can generate decent looking constructs, but it comes up values for you like serverlessV2MinCapacity or sometimes it creates resources I don't need. It can end up costing a lot if you then deploy something you didn't expect to.
Since running destroy and deploy also takes a long time, gets stuck, throws weird errors etc, one still needs to read the docs for many things and understand the constructs it outputs.

by jopsen

0 subcomment

There will probably be some yolo startups that deploy write-only code to production with unreviewed terraform plans -- who knows this could be disruptive -- but I'm also certain this won't be the last such story.
---
All that being said: it's kind of sad because terraform is fairly declarative and the plans are fairly high-level.
Hence, terraform files and plans are the stuff you should review.
Where as a bunch of imperative code implementing CRUD with fancy UI might be the kind of spaghetti code that's hard to review.

by jpalmer

2 subcomments

If you delete all your backups, AWS maintains shadow backups they can restore? Is that right?

by nusl

0 subcomment

Bit of a story of negligence, ignorance, and laziness. I can't say I have much of any sympathy. There were multiple steps that they could have intervened and chose not to.
Good story of what not to do though

by testplzignore

0 subcomment

You should always use Object Lock with compliance mode on your S3 backups. Always.

by etothet

0 subcomment

Also why regular backups are necessary. Glad they helped in this case.
With great power…

by 01284a7e

0 subcomment

I'm cool with blogging about your fuck-ups, but honestly, not really. Is "I'm incompetent" a good content strategy? Your product is a thousand bucks a year. I'm not going near it. But that's just me?

by nozzlegear

0 subcomment

No consequences for Claude, only consequences for the human who put their faith in it.

by jvolkman

0 subcomment

Back in my day, we didn't need AI to accidentally drop production databases.

by Imustaskforhelp

1 subcomments

One of the largest things i am learning from these stories (Tangentially Wikipedia story too) is to have backups outside of your own infrastructure with snapshots at 15 minute recovery time preferably when possible
For context it was 2.5 years of data. I can only just imagine the nightmare if things would've turned out even a tiny bit more worse for ya. The nightmare it would've been if snapshot of the production database wouldn't have been found even within the AWS business support.
> I was overly reliant on my Claude Code agent, which accidentally wiped all production infrastructure for the DataTalks.Club course management platform that stored data for 2.5 years of all submissions: homework, projects, leaderboard entries, for every course run through the platform.

by gorfian_robot

0 subcomment

this seems like a lot of moving parts for 2M rows ....

by x3n0ph3n3

0 subcomment

There are so many mistakes being made here:
- Not using remote state management (setting up an S3 backend is easy and you're already in AWS!)
- Allowing an AI agent to execute against your production environment (especially with no guardrails)
- Not confirming the plan (which I _could_ excuse if one's pipeline is mature enough)
- Not confirming the resources Claude identified automatically before letting it delete things
- Combining 2 projects into the same state.
These mistakes are so horribly egregious that I feel second-hand embarrassment.

by HackerThemAll

0 subcomment

Again the same crying dev baby that did not make backups, blaming AI on the issue. Idiocracy is happening right before our eyes.

by UltraSane

0 subcomment

I'm amazed at how some people are willing to tell the world about making incredibly stupid mistakes like this. The user he was using should NOT have had delete permissions.

by octoclaw

0 subcomment

[dead]

by gneray

6 subcomments

Everyone here firing shots at this guy should try holding their tongues.
You/we are all susceptible to this sort of thing, and I call BS on anyone who says they check every little thing their agent does with the same level of scrutiny as they would if they were doing it manually.