FRESH

Hacker News

Home

Infrastructure decisions I endorse or regret after 4 years at a startup (2024)

495 points by Meetvelde

by Hovertruck

1 subcomments

Much of this matches my own experience. A few thoughts:
1. Cost tracking meetings with your finance team are useful, but for AWS and other services that support it I highly recommend setting billing alarms. The sooner you can know about runaway costs, the sooner you can do something about it.
2. Highly recommend PGAnalyze (https://pganalyze.com/) if you're running Postgres in your stack. It's really intuitive, and has proven itself invaluable many times when debugging issues.
3. Having used Notion for like 7 years now, I don't think I love it as much as I used to. I feel like the "complexity" of documents gets inflated by Notion and the number of tools it gives you, and the experience of just writing text in Notion isn't super smooth IMO.
4. +1 to moving off JIRA. We moved to Shortcut years ago, I know Linear is the new hotness now.
5. I would put Datadog as an "endorse". It's certainly expensive but I feel we get loads of value out of it since we leaned so heavily into it as a central platform.

by econner

6 subcomments

It's weird that one of the reasons that you endorse AWS is that you had regular meetings with your account manager but then you regret premium support which is the whole reason you had regular meetings with your account manager.

by kstrauser

5 subcomments

> Picking Terraform over Cloudformation: Endorse
I, too, prefer McDonald's cheeseburgers to ground glass mixed with rusty nails. It's not so much that I love Terraform (spelled OpenTofu) as that it's far and away the least bad tool I've used in the space.

by SoftTalker

1 subcomments

After listing dozens of infrastructure products/projects, "My general infrastructure advice is “less is better”.
That made me laugh. Yes I get that they probably didn't use all of these at the same time.

by calmbonsai

3 subcomments

This is the best post to HN in quite some time. Kudos to the detailed and structured break-down.
If the author had a Ko-Fi they would've just earned $50 USD from me.
I've been thinking of making the leap away from JIRA and I concur on RDS, Terraform for IAC, and FaaS whenever possible. Google support is non-existent and I only recommend GC for pure compute. I hear good things about Big Table, but I've never used in in production.
I disagree on Slack usage aside from the postmortem automation. Slack is just gonna' be messy no matter what policies are put in place.

by spprashant

0 subcomment

I have seen this post before. I don't know what exactly they do, but that's an extraordinary list of products to be managing. I hope they are making enough revenue to cover those outrageous costs.

by rf15

6 subcomments

> Startups don’t have the luxury of a DBA
but... you are spending so much on AWS and premium support... surely you can afford that

by rco8786

2 subcomments

I think we're making a mistake by shoving all of this into the cloud rather than building tooling around local agents (worktrees, containers, as mentioned as "difficult" in the post). I think as an industry we just reach for cloud like our predecessors reached for IBM, without critical thought about what's actually the right tool for the job.
If you can manage docker containers in a cloud, you can manage them on your local. Plus you get direct access to your own containers, local filesystems and persistence, locally running processes, quick access for making environmental tweaks or manual changes in tandem with your agents, etc. Not to mention the cost savings.

by kolja005

2 subcomments

>Since the database is used by everyone, it becomes cared for by no one. Startups don’t have the luxury of a DBA, and everything owned by no one is owned by infrastructure eventually.
This post was a great read.
Tangent to this, I've always found "best practices" to be a bit of a misnomer. In most cases in software and especially devops I have found it means "pay for this product that constrains the way that you do things so you don't shoot yourself in the foot". It's not really a "practice" if you're using a product that gives you one way to do something. That said my company uses a very similar tech stack and I would choose the same one if I was starting a company tomorrow, despite the fact that, as others have mentioned, it's a ton to keep in your head all at once.

by yakkomajuri

1 subcomments

> "Like most tech debt, we didn’t make this decision, we just did not not make this decision."
This is an important point.

by indiestack

0 subcomment

The SQLite-per-customer pattern mentioned in the database subthread is underrated. I've been running a FastAPI app with a single SQLite database (WAL mode + FTS5) and the operational simplicity is genuinely life-changing compared to managing Postgres.
The key insight: for read-heavy workloads on a single machine, SQLite eliminates the network hop entirely. Response times drop to sub-15ms for full-text search queries. The tradeoff is write concurrency, but if your write volume is low (mine is ~20/day), it's a non-issue.
The one thing I'd add to the article: the biggest infrastructure regret I see is premature complexity. Running Postgres + Redis + a message queue when your app gets 100 requests/day is solving problems you don't have while creating problems you do (operational overhead, debugging distributed state, config drift between environments).

by joshdick

0 subcomment

> everything owned by no one is owned by infrastructure eventually
Whoa, now there is a truth bomb. I've seen this happen a bunch, but never put it this succinctly before.

by consumer451

8 subcomments

> Multiple applications sharing a database [0]
> Regret
Thanks for this data point. I am currently trying to make this call, and I was still on the fence. This has tipped me to the separate db side.
Can anyone else share their experience with this decision?
[0] https://cep.dev/posts/every-infrastructure-decision-i-endors...

by ungreased0675

0 subcomment

Is it common practice for a startup to have ALL of those SaaS subscriptions?
It seems excessive and expensive. Is this what most startups are doing these days?

by artyom

1 subcomments

> Multiple applications sharing a database
This is a classic. I'd say that for every company, big or small, ends up taking the #1 spot on technical debt.

by esoterae

0 subcomment

Honestly, this is a reasonable itemization of experience with individual tools, but this reads like a recipe for Company Cake instead of a case-by-case statement of need, selection, and then evaluation. Cargo culting continues to wrap its tendrils around the industry and try to drag it into the depths of mediocrity, and this largely reads to me like a primer for how to saddle yourself with endless SaaS bills. I recognize that every situation has its nuances, but I think approaching running a company from "what tools do you use" is pretty much the biggest possible example of ignoring that maxim.

by sylens

0 subcomment

The part about account teams for AWS and GCP is very true in my experience. I could tell my AWS account team that I was hungry and they would offer to bring me a bagel in an hour. My GCP account team no-shows our cadence calls and somehow forgets the one question I ask them in the intervening time between our calls, which means each month I get to re-explain the issue as they pretend to escalate it again.

by kaycey2022

1 subcomments

Feels like a minor glimpse into what's involved in running tech companies these days. Sure this list could be much simpler, but then so would the scope of the company's offerings. So AI would offer enough accountability to replace all of this? Agents juggling million token contexts? It's kind of hard to wrap my head around.

by mettamage

4 subcomments

As a non infra guy I'll say this. I'm curious about Linear. At my own company I vibecoded my own project management app against the JIRA API because I can't stand our version of JIRA. It's too many clicks, too many things to remember and it's unintuitive.

by wavemode

2 subcomments

(2024)
past discussion: https://news.ycombinator.com/item?id=39313623

by stego-tech

0 subcomment

Love it. Excellent reasoning for subjective decisions that don’t knock the product or solution itself as much as, “not what we specifically needed, and that’s okay”.
Bookmarked for my own infrastructure transformations. Honestly, if Okta could spit out a container or appliance that replaces on-prem ADDCs for LDAP, GPOs, and Kerberos, I’d give them all the money. They’re just so good.

by nevalainen

0 subcomment

There is a lot of "stuff" for liking to keep it simple. Great article though!

by neo_doom

1 subcomments

> Regret: Not adopting an identity platform early on. I stuck with Google Workspace at the start...
I've worked with hundreds of customers to integrate IdP's with our application and Google Workspace was by far the worst of the big players (Entra ID, Okta, Ping). Its extremely inflexible for even the most basic SAML configuration. Stay far, far away.

by mwcampbell

1 subcomments

I disagree on Kubernetes versus ECS. For me, the reasons to use ECS are not having to pay for a control plane, and not having to keep up with the Kubernetes upgrade treadmill.

by bigiain

0 subcomment

I initially read this wrong as "Almost every infrastructure decision I make I regret after 4 years", and I nodded my head in agreement.
I've been working mostly at startups most of my career (for Sydney Australia values of "start up" which mostly means "small and new or new-ish business using technology", not the Silicon Valley VC money powered moonshot crapshoot meaning). Two of those roles (including the one I'm in now) have been longer that a decade.
And it's pretty much true that almost all infrastructure (and architecture) decisions are things that 4-5 years later become regrets. Some standouts from 30 years:
I didn't choose Macromind/Macromedia Director in '94 but that was someone else's decision I regretted 5 years later.
I shouldn't have chosen to run a web business on ISP web hosting and Perl4 in '95 (yay /cgi-bin).
I shouldn't have chosen globally colocated desktop pc linux machines and MySQL in '98/99 (although I got a lot of work trips and airline miles out of that).
I shouldn't have chosen Python2 in 2007, or even worse Angular2 in 2011.
I _probably_ shouldn't have chosen Arch Linux (and a custom/bastardised Pacman repo) for a hardware startup in 2013.
I didn't choose Groovy on Grails in 2014 but I regretted being recruited into being responsible for it by 2018 or so.
I shouldn't have chosen Java/MySQL in 2019 (or at least I should have kept a much tighter leash on the backend team and their enterprise architecture astronaut).
The other perspective on all those decisions though, each of them allowed a business to do the things they needed to take money off customers (I know I know, that's not the VC startup way...) Although I regretted each of those later, even in retrospect I think I made decent pragmatic choices at the time. And at this stage of my career I've become happy enough knowing that every decision is probably going to have regrets over a 4 or 5 year timeframe, but that most projects never last long enough for you to get there - either the business doesn't pass out and closes the project down, or a major ground up rewrite happens for reasons often unrelated to 5 year old infrastructure or architecture choices.

by stroebs

0 subcomment

The Bottlerocket issues really surprise me - not an experience I've shared even with heavy use. I use EKS with Bottlerocket + managed addons + Karpenter, and our security team is super happy that _nobody_ has access to the underlying nodes. Immutable OS is a key selling point, and Brupop "just works" to keep everything up to date without any input. Patching nodes is something I haven't had to think about in almost a year.

by arush15june

0 subcomment

I'll switch in Cloudflare Zero Trust for Okta simply for the fact that Cloudflare Access and Tunnels + An identity provider (we use M365) give you so much value (and it's free upto 50 users). It is even better if you are already running DNS on Cloudflare, you can securely deploy access-controlled apps on the internet without too much of a hassle and management. And with the recent addition for Infrastructure for SSH you can securely extend SSH access just as seamlessly.

by lightyrs

1 subcomments

Interested to know what's changed (if anything) in the two years since this was written.

by Grimburger

0 subcomment

> There are no great FaaS options for running GPU workloads
Knative on k8s works well for us, there's some oddities about it but in general does the job

by zem

5 subcomments

I would love to read more about the pros and cons of using a single database, if anyone has pointers to articles

by jbmsf

0 subcomment

Thanks. I've been meaning to write one of these for a long time, but you went into detail in a very effective, organized way.
I also reached a lot of similar decisions and challenges, even where we differ (ECS vs EKS) I completely understand your conclusions.

by jmward01

0 subcomment

You will never agree 100% with someone else when it comes to decisions like this, but clearly there is a lot of history behind these decisions and they are a great starting point for conversations internally I think.

by bob1029

2 subcomments

> Not using Function as a Service(FaaS) more
FaaS is almost certainly a mistake. I get the appeal from an accountant's perspective, but from a debugging and development perspective it's really fucking awful compared to using a traditional VM. Getting at logs in something like azure functions is a great example of this.
I pushed really hard for FaaS until I had to support it. It's the worst kind of trap. I still get sweaty thinking about some of the issues we had with it.

by bfeynman

0 subcomment

All in on AWS and using GitOps with TF instead of much more feature rich CDK...

by thundergolfer

0 subcomment

> There are no great FaaS options for running GPU workloads
modal.com exists now

by prplfsh

0 subcomment

I feel so many of these. LOL @ GitHub endorse-ish, more -ish every day now. Overall though seems like a pretty good hit rate.
Surprised to see datadog as a regret - it is expensive but it's been enormously useful for us. Though we don't run kubernetes, so perhaps my baseline of expensive is wrong.

by jrjeksjd8d

5 subcomments

I see you regret Datadog but there's no alternative - did you end up homebrewing metrics, or are you just living with their insane pricing model? In my experience they suck but not enough to leave.

by robszumski

0 subcomment

Thanks for sharing, really helpful to see your thinking. I haven't fully embraced FaaS myself but never regretted it either.
Curious to hear more about Renovate vs Dependabot. Is it complicated to debug _why_ it's making a choice to upgrade from A to B? Working on a tool to do app-specific breaking change analysis so winning trust and being transparent about what is happening is top of mind.
When were you using quay.io? In the pre-CoreOS years, CoreOS years (2014-2018), or the Red Hat years?

by shockwaverider

0 subcomment

I’m CTO for a startup that was recently acquired for $100M+ I agree with everything in this post apart from Go because I’m just not a big fan of the language.

by findalex

0 subcomment

My measurement of quality going in was how far I need to scroll to see EKS. Very high quality.

by herpdyderp

0 subcomment

Does anyone have thoughts on how bunny.net compares to Netlify? I've been getting sick of them (and their pricing) recently...

by dwedge

0 subcomment

This just feels like an article by someone who took an AWS course and promotes the company line

by ttoinou

0 subcomment

Nice but how do those services combine with each others ? How do you combine notion, slack, your git hosting, linear and your CI/CD ? If there are only URLs between each others it’s hard to link all the work together

by ink_13

0 subcomment

(2024)
Just FYI article is two years old

by isoprophlex

0 subcomment

> There are no great FaaS options for running GPU workloads
I love modal. I think they got FaaS for GPU exactly right, both in terms of their SDK and the abstractions/infra they provide.

by weedhopper

0 subcomment

Great post. I even wouldn’t mind more details, especially about datadog, or as others pointed out, the kind of contradiction with aws support.

by 0xbadcafebee

1 subcomments

Using GCP gives me the same feeling as vibe-coded source code. Technically works but deeply unsettling. Unless GCP is somehow saving you boatloads of cash, AWS is much better.
RDS is a very quick way to expand your bill, followed by EC2, followed by S3. RDS for production is great, but you should avoid the bizarre HN trope of "Postgres for everything" with RDS. It makes your database unnecessarily larger which expands your bill. Use it strategically and your cost will remain low while also being very stable and easy to manage. You may still end up DIYing backups. Aurora Serverless v2 is another useful way to reduce bill. If you want to do custom fancy SQL/host/volume things, RDS Custom may enable it.
I'm starting to think Elasticache is a code smell. I see teams adopt it when they literally don't know why they're using it. Similar to the "Postgres for everything" people, they're often wasteful, causing extra cost and introducing more complexity for no benefit. If you decide to use Elasticache, Valkey Serverless is the cheapest option.
Always use ECR in AWS. Even if you have some enterprise artifact manager with container support... run your prod container pulls with ECR. Do not enable container scanning, it just increases your bill, nobody ever looks at the scan results.
I no longer endorse using GitHub Actions except for non-business-critical stuff. I was bullish early on with their Actions ecosystem, but the whole thing is a mess now, from the UX to the docs to the features and stability. I use it for my OSS projects but that's it. Most managed CI/CD sucks. Use Drone.io for free if you're small, use WoodpeckerCI otherwise.
Buying an IP block is a complicated and fraught thing (it may not seem like it, but eventually it is). Buy reserved IPs from AWS, keep them as long as you want, you never have to deal with strange outages from an RIR not getting the correct contact updated in the correct amount of time or some foolishness.
He mentions K8s, and it really is useful, but as a staging and dev environment. For production you run into the risk of insane complexity exploding, and the constant death march of upgrades and compatibility issues from the 12 month EOL; I would not recommend even managed K8s for prod. But for staging/dev, it's fantastic. Give your devs their own namespace (or virtual cluster, ideally) and they can go hog wild deploying infrastructure and testing apps in a protected private environment. You can spin up and down things much easier than typical AWS infra (no need for terraform, just use Helm) with less risk, and with horizontal autoscaling that means it's easier to save money. Compare to the difficulty of least-privilege in AWS IAM to allow experiments; you're constantly risking blowing up real infra.
Helm is a perfectly acceptable way to quickly install K8s components, big libraries of apps out there on https://artifacthub.io/. A big advantage is its atomic rollouts which makes simple deploy/rollback a breeze. But ExternalSecrets is one of the most over-complicated annoying garbage projects I've ever dealt with. It's useful, but I will fight hard to avoid it in future. There are multiple ways to use it with arcane syntax, yet it actually lacks some useful functionality. I spent way too much time trying to get it to do some basic things, and troubleshooting it is difficult. Beware.
I don't see a lot of architectural advice, which is strange. You should start your startup out using all the AWS well-architected framework that could possibly apply to your current startup. That means things like 1) multiple AWS accounts (the more the better) with a management account & security account, 2) identity center SSO, no IAM users for humans, 3) reserved CIDRs for VPCs, 4) transit gateway between accounts, 5) hard-split between stage & prod, 6) openvpn or wireguard proxy on each VPC to get into private networks, 7) tagging and naming standards and everything you build gets the tags, 8) put in management account policies and cloudtrail to enforce limitations on all the accounts, to do things like add default protections and auditing. If you're thinking "well my startup doesn't need that" - only if your startup dies will you not need it, and it will be an absolute nightmare to do it later (ever changed the wheels on a moving bus before?). And if you plan on working for more than one startup in your life, doing it once early on means it's easier the second time. Finally if you think "well that will take too long!", we have AI now, just ask it to do the thing and it'll do it for you.

by dangoodmanUT

0 subcomment

> There are no great FaaS options for running GPU workloads, which is why we could never go fully FaaS.
modal.com???

by gnarbarian

0 subcomment

given enough time you may regret every single one of them.

by mlrtime

2 subcomments

Pagerduty: They haven't yet hit that point where PD doubles the prices for them. Or they don't have everyone on the platform, it will be their next Datadog (too expensive)

by AIorNot

1 subcomments

Its insane how many SaaS solutions are needed piecemeal to run a company these days - just listing everything out like that made it apparent

by mads_quist

0 subcomment

Jesus, this piece reads like it's from 2024. Oh, it's from 2024.

by rixed

1 subcomments

Sure, let's take advices about infrastructure from that guy wo needs a tool to automate postmortems.

by piokoch

1 subcomments

I've just look out of curiosity on Appsmith, as the author endorsed this tool as some admin panel builder. I had to double check the name, as right now this is, surprise, surprise, AI powered application builder...
I used to use Replit for educational purposes, to be able to create simple programs in any language and share them with others (teachers, students). That was really useful.
Now Replit is a frontend to some AI chat that is supposed to write software for me.
Is this jumping into AI bandwagon everywhere a new trend? Is this really needed? Is this really profitable?

by themafia

1 subcomments

> “This EC2 instance type running 24/7 at full load is way less expensive than a Lambda running”.
For the same amount of memory they should cost _nearly_ identical. Run the numbers. They're not significantly different services. Aside from this you do NOT pay for IPv4 when using Lambda, you do on EC2, and so Lambda is almost always less expensive.

by rewilder12

0 subcomment

lol the ingress-nginx endorsement aged well

by gib444

2 subcomments

Infra guys doing DBA is a nightmare in my experience (usually clueless and it gets loved less than more sexy parts of infra). Devs too
Hire a DBA ASAP. They need to reign in also the laziness of all other developers when designing and interacting with the DB. The horrors a dev can create in the DB can take years to undo

by MaXtreeM

0 subcomment

Previous discussion (626 comments): https://news.ycombinator.com/item?id=39313623

by Adexintart

1 subcomments

[flagged]