FRESH

Hacker News

Home

Cloudflare outage should not have happened

160 points by b-man

by vessenes

18 subcomments

"If they had a perfectly normalized database, no NULLing and formally verified code, this bug would not have happened."
That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.
Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.
I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)

by locknitpicker

5 subcomments

This sort of Monday morning quarterbacking is pointless and only serves as a way for random bloggers to try to grab credit without actually doing or creating any value.

by hodgesrm

0 subcomment

> I base my paragraph on their choice of abandoning PostgreSQL and adopting ClickHouse(Bocharov 2018). The whole post is a great overview on trying to process data fast, without a single line on how to garantee its logical correctness/consistency in the face of changes.
I'm completely mystified how the author concludes that the switch from PostgreSQL to ClickHouse shows the root of this problem.
1. If the point is that PostgreSQL is somehow more less prone to error, it's not in this case. You can make the same mistake if you leave off the table_schema in information_schema.columns queries.
2. If the point is that Cloudflare should have somehow discovered this error through normalization and/or formal methods, perhaps he could demonstrate exactly how this would have (a) worked, (b) been less costly than finding and fixing the query through a better review process or testing, and (c) avoided generating other errors as a side effect.
I'm particularly mystified how lack of normalization is at fault. ClickHouse system.columns is normalized. And if you normalized the query result to remove duplicates that would just result in other kinds of bugs as in 2c above.
Edit: fix typo

by cmckn

3 subcomments

I agree it should not have happened, but I don’t agree that the database schema is the core problem. The “logical single point of failure” here was created by the rapid, global deployment process. If you don’t want to take down all of prod, you can’t update all of prod at the same time. Gradual deployments are a more reliable defense against bugs than careful programming.

by tptacek

3 subcomments

Cloudflare doesn't seem to have called it a "Root Cause Analysis" and, in fact, the term "root cause" doesn't appear to occur in Prince's report. I bring this up because there's a school of thought that says "root cause analysis" is counterproductive: complex systems are always balanced on the precipice of multicausal failure.

by nine_k

6 subcomments

* The unwrap() in production code should have never passed code review. Damn, it should have been flagged by a linter.
* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.
* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.

by 1a527dd5

1 subcomments

Unless you work at Cloudflare or have worked at Cloudflare I'm not sure opinions like this help.
You don't know the context, you don't know _anything_ except for what Cloudflare chooses to share.
There are very few companies who deal with the kind of load that Clouldflare does, I dread to think what weird edges cases they've run into because of their sheer scale.

by this_user

0 subcomment

Of course it shouldn't have happened. But if you run infrastructure as complex as this on the scale that they do, and with the agility that they need, then it was bound to happen eventually. No matter how good you are, there is always some extremely unlikely chain of events that will lead to a catastrophic out. Given enough time, that chain will eventually happen.

by hvb2

0 subcomment

> A central database query didn’t have the right constraints to express business rules. Not only it missed the database name, but it clearly needs a distinct and a limit, since these seem to be crucial business rules.
In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?
And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.

by jrm4

0 subcomment

Nothing in this thread about "this should not have happened because Cloudflare is too centralized?"
We have far better ideas and working prototypes in terms of how to prevent this from happening again to be up here trying to "fix Cloudflare."
Think bigger, y'all.

by PunchyHamster

0 subcomment

> but it clearly needs a distinct and a limit, since these seem to be crucial business rules.
Isn't that just... wrong ? Throwing arbitrary limit (vs maybe having some alert when the table is too long) would just silently truncate the list
Anybody can be backseat engineer by throwing out industry's best practices like they were gospel but you have to look at entire system, not just the database part

by kjuulh

1 subcomments

It did happen, and cloudflare should learn from it, but not just the technical reasons.
Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.
Why: Proxy fails requests
Why: Handlers crashed because of OOM
Why: Clickhouse returns too much data
Why: A change was introduced causing double the amount of data
Why: A central change was rolled out immediately to all cluster (single point of failure)
Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.
While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.

by jmull

0 subcomment

I think the author is trying to apply a preconceived cause on to the cloudflare outage, but there’s not a fit.
E.g., they should try to work through how their own suggested fix would actually ensure the problem couldn’t happen. I don’t believe it would… lack of nullable fields and normalization typically simplify relational logic, but hardly prevent logical errors. Formal verification can prove your code satisfies a certain formal specification, but doesn’t prove your specification solves your business problem (or makes sense at all, in fact).

by notepad0x90

0 subcomment

The real RCA (IMHO) is not simulating outages in production as part of reliability engineering.
Whatever process was stuck in a loop, crashed, or whatever service (db, dns,etc..) was unavailable, that outage scenario can be simulated. Changes can have an automated rollback requirement.
My take away is that CF has single points of failure they're aware of, and for business reasons, they've decided to not have a redundancy/failover.
> ...and formally verified code, this bug would not have happened.
That's what I mean, "we should have caught the bug" , yeah, but that isn't reliability engineering. You assume there will be bugs/outages and prepare for them instead. What happens if the entire DB entered a weird state and was spitting out valid results with incorrect values? What happens if it accepts connections and just stalls?
You prepare for bugs that don't yet exist, you fix bugs that do exist.

by 2d8a875f-39a2-4

0 subcomment

TFA has a point that it should never have happened, and that CF software engineering practices are likely to blame.
But a BCNF (or 5NF or whatever) database without nullable columns wouldn't have prevented it. Formally verified code might have but that remains a pipe dream for any significant code base.
The proposed cure is worse than the disease.

by necovek

0 subcomment

I have to disagree on the tests not potentially helping here. Finding the right abstraction layer is hard, but there was obviously no integration test that tested wherever the original query was being constructed and where the output was being used. A single smoke test would have failed the same way their actual infra failed when the change was introduced.
Obviously, that's not to say that writing normalized database schemas and formal specification won't reduce the number of problems you will introduce. But people make mistakes anywhere, which could have been the case here with the query even if the DB was in a NF (and it still could have been in their case), or in the formal spec as well.
There is no magic bullet for correctness, unfortunately.

by testemailfordg2

0 subcomment

https://blog.cloudflare.com/18-november-2025-outage/
"Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."
This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.
I hope it was not because of AI driven efficiency gains.

by mvkel

0 subcomment

This piece feels a lot like someone criticizing an umpire's call after watching the slo-mo fifteen times and concluding the ball was actually a strike.
Way different from the umpire's pov

by juujian

0 subcomment

They are not going as far as to blame PostgreSQL, but their switch to ClickHouse seems to suggest that they see PostgreSQL as part of the equation. Would ClickHouse really prevent this type of error from occurring? PostgreSQL already has so many options for setting up solid constrains for data entry. Or do they not have anyone on the team anymore (or never had) who could set up a robust PostgreSQL database? Or are they just piggybacking on the latest trend?

by yakovsi

1 subcomments

Adding distinct or group by to a query is not some advanced technic comments are suggesting. It does not slow down development one bit, if you expect distinct result you put explicit distinct in the query, it's not a "safety measure for insulin pumps". Scratching my head what I've missed here, please enlighten me.

by avereveard

0 subcomment

"If only the world was perfect the world would be perfect"
Author fails to mention how to actually formally verify this asynchronous globally replicated product. He may have solved the delivery theorem and if that's so I encourage him sharing the results.
> No nullable fiels.
Author appears to have not formally verified his post's grammar.

by wat10000

0 subcomment

Are there outages that should have happened?

by linsomniac

0 subcomment

I'd be wanting to have some sort of a "dry run" on the produced artifact by the rust code consuming it, or a deploy to some sort of a test environment before letting it roll out to production. I've been surprised that no mention of that sort of thing in the Cloudflare after-action or here.

by pizlonator

0 subcomment

> No nullable fiels.
If you take away nullability, you eventually get something like a special state that denotes absence and either:
- Assertions that the absence never happens.
- Untested half-baked code paths that try (and fail) to handle absence.
> formally verified
Yeah, this does prevent most bugs.
But it's horrendously expensive. Probably more expensive than the occasional Cloudflare incident

by k3vinw

1 subcomments

I was expecting a critique on the centralized nature of the infrastructure and the fragility that comes with it.

by renewiltord

0 subcomment

One of the things I recommend most engineers do when they write a bug is to first take a look and see if the bug is required. Very often, I see that the codebase doesn't need the bug added. Then I can just rewrite that code without the bug.

by ruuda

0 subcomment

Sure, a different database schema may have helped, but there are going to be bugs either way. In my view a more productive approach is to think about how to limit the blast radius when things inevitably do go wrong.

by nyrikki

1 subcomments

Hindsight bias is always easier but:
> FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.
That relational rigor imposes what one chooses to be true, it isn’t a universal truth.
The frame problem and the qualification problem apply here.
The open domain frame problem == HALT.
When you can for a problem into the relational model things are nice but not everything can be reduced to a trivial property.
That is why Codd had to as nulls etc..
You can choose to decide that the queen is rich OR pigs can fly; but a poor queen doesn’t result in flying pigs.
Choice over finite sets == finite indexes over sets == PEM
If you can restrict your problems to where the Entscheidungsproblem is solvable you can gain many benefits
But it is horses for courses and sub TC.

by 9cb14c1ec0

0 subcomment

As an aside, I find it really interesting how Cloudflare has morphed from CDN/DDOS protection into a services conglomerate that many startups could use for every compute need they have.

by ramon156

0 subcomment

While this blog post is pretty useless, it's a hell of a lot better than the LinkedIn posts about the outage... my god, I wish the "Not interested" button worked.

by block_dagger

0 subcomment

I initially read the title as "Cloudflare outrage.." and I was thinking how nice someone is thinking of the poor engineers who crashed the Internet.

by RenThraysk

0 subcomment

Would be interesting to see the DDL of the table, to see if it had unique constraints.
The query not utilising an unique constraint/index should have raised a red flag.

by ku1ik

0 subcomment

Also please appreciate how fast this site is. The average website bloat is imperceptible until you open a page like this.

by mikece

1 subcomments

Yes, pretty basic looking mistakes that, from the outside, make many wonder how this got through. Though analyzing the post-mortem makes me think of the MV Dali crashing into the Francis Scott Key bridge in Baltimore: the whole thing started with a single loose wire which set off a cascading failure. CF's situation was similar in a few ways though finding a bad query (and .unwrap() in production code rather than test code) should have been a lot easier to spot.
Have any of the post-mortems addressed if any of the code that led to CloudFlare's outage was generated by AI?

by aforwardslash

0 subcomment

rolls eyes
No, their error was that they shouldn't be querying system tables to perform field discovery; the same method in postgresql (pg_class or whatever its called) would have had the same result. The simple alternative is to use "describe table <table_name>".
On top of that, they shouldn't be writing ad-hoc code to query system tables, but having a separate library instead to perform those kind of task mixed with business logic (crappy application design).
Also, this should never have passed code review in the first place, but lets assume it did because errors happen, and this kind of atrocious code and flaky design is not uncommon.
As an example, they could be reading this data from CSV files *and* have made the same mistake. Conflating this with "database design errors" is just stupid - this is not a schema design error, this is a programmer error.

by devy

1 subcomments

Author's real cause prevention notes.
> 1. No nullable fiels.
Is that a typo there? fiels should be fields?

by mediumsmart

0 subcomment

Cloudflare is actually an internet outage waiting to happen.

by etchalon

1 subcomments

"This massive, accomplished engineering team whose software operates at a scale nearly no one else operates at missed this basic thing" is a hell of a take.

by knorker

0 subcomment

No, this is nonsense and look like university student naivety.
What caused it was rolling out a change and moving on to the next recipient without checking if the previous task instantly died.
You can't prevent all crash bugs, but you can check if you are lasering your whole prod.