That may be. What's not specified there is the immense, immense cost of driving a dev org on those terms. It limits, radically, the percent of engineers you can hire (to those who understand this and are willing to work this way), and it slows deployment radically.
Cloudflare may well need to transition to this sort of engineering culture, but there is no doubt that they would not be in the position they are in if they started with this culture -- they would have been too slow to capture the market.
I think critiques that have actionable plans for real dev teams are likely to be more useful than what, to me, reads as a sort of complaint from an ivory tower. Culture matters, shipping speed matters, quality matters, team DNA matters. That's what makes this stuff hard (and interesting!)
I'm completely mystified how the author concludes that the switch from PostgreSQL to ClickHouse shows the root of this problem.
1. If the point is that PostgreSQL is somehow more less prone to error, it's not in this case. You can make the same mistake if you leave off the table_schema in information_schema.columns queries.
2. If the point is that Cloudflare should have somehow discovered this error through normalization and/or formal methods, perhaps he could demonstrate exactly how this would have (a) worked, (b) been less costly than finding and fixing the query through a better review process or testing, and (c) avoided generating other errors as a side effect.
I'm particularly mystified how lack of normalization is at fault. ClickHouse system.columns is normalized. And if you normalized the query result to remove duplicates that would just result in other kinds of bugs as in 2c above.
Edit: fix typo
* The deployment should have followed the blue/green pattern, limiting the blast radius of a bad change to a subset of nodes.
* In general, a company so much at the foundational level of internet connectivity should not follow the "move fast, break things" pattern. They did not have an overwhelming reason to hurry and take risks. This has burned a lot of trust, no matter the nature of the actual bug.
You don't know the context, you don't know _anything_ except for what Cloudflare chooses to share.
There are very few companies who deal with the kind of load that Clouldflare does, I dread to think what weird edges cases they've run into because of their sheer scale.
In a database, you wouldn't solve this with a distinct or a limit? You would make the schema guarantee uniqueness?
And yes, that wouldn't deal with cross database queries. But the solution here is just the filter by db name, the rest is table design.
We have far better ideas and working prototypes in terms of how to prevent this from happening again to be up here trying to "fix Cloudflare."
Think bigger, y'all.
Isn't that just... wrong ? Throwing arbitrary limit (vs maybe having some alert when the table is too long) would just silently truncate the list
Anybody can be backseat engineer by throwing out industry's best practices like they were gospel but you have to look at entire system, not just the database part
Instead of focusing on the technical reasons why, they should answer how such a change bubbled out to cause such a massive impact instead.
Why: Proxy fails requests
Why: Handlers crashed because of OOM
Why: Clickhouse returns too much data
Why: A change was introduced causing double the amount of data
Why: A central change was rolled out immediately to all cluster (single point of failure)
Why: There are exemptions or standard operating procedure (gate) for releasing changes to the hot path for cloudflares network infra.
While the Clickhouse change is important, I personally think it is crucial that Cloudflare tackles the processes, and possibly gates / controls rollout for hot path system, no matter what kind of change they are when they're at that scale it should be possible. But that is probably enough co-driving. It to me seems like a process issue more than a technical one.
E.g., they should try to work through how their own suggested fix would actually ensure the problem couldn’t happen. I don’t believe it would… lack of nullable fields and normalization typically simplify relational logic, but hardly prevent logical errors. Formal verification can prove your code satisfies a certain formal specification, but doesn’t prove your specification solves your business problem (or makes sense at all, in fact).
Whatever process was stuck in a loop, crashed, or whatever service (db, dns,etc..) was unavailable, that outage scenario can be simulated. Changes can have an automated rollback requirement.
My take away is that CF has single points of failure they're aware of, and for business reasons, they've decided to not have a redundancy/failover.
> ...and formally verified code, this bug would not have happened.
That's what I mean, "we should have caught the bug" , yeah, but that isn't reliability engineering. You assume there will be bugs/outages and prepare for them instead. What happens if the entire DB entered a weird state and was spitting out valid results with incorrect values? What happens if it accepts connections and just stalls?
You prepare for bugs that don't yet exist, you fix bugs that do exist.
But a BCNF (or 5NF or whatever) database without nullable columns wouldn't have prevented it. Formally verified code might have but that remains a pipe dream for any significant code base.
The proposed cure is worse than the disease.
Obviously, that's not to say that writing normalized database schemas and formal specification won't reduce the number of problems you will introduce. But people make mistakes anywhere, which could have been the case here with the query even if the DB was in a NF (and it still could have been in their case), or in the formal spec as well.
There is no magic bullet for correctness, unfortunately.
"Customers deployed on the new FL2 proxy engine, observed HTTP 5xx errors. Customers on our old proxy engine, known as FL, did not see errors, but bot scores were not generated correctly, resulting in all traffic receiving a bot score of zero."
This simply means, the exception handling quality of your new FL2 is non-existent and is not at par / code logic wise similar to FL.
I hope it was not because of AI driven efficiency gains.
Way different from the umpire's pov
Author fails to mention how to actually formally verify this asynchronous globally replicated product. He may have solved the delivery theorem and if that's so I encourage him sharing the results.
> No nullable fiels.
Author appears to have not formally verified his post's grammar.
If you take away nullability, you eventually get something like a special state that denotes absence and either:
- Assertions that the absence never happens.
- Untested half-baked code paths that try (and fail) to handle absence.
> formally verified
Yeah, this does prevent most bugs.
But it's horrendously expensive. Probably more expensive than the occasional Cloudflare incident
> FAANG-style companies are unlikely to adopt formal methods or relational rigor wholesale. But for their most critical systems, they should. It’s the only way to make failures like this impossible by design, rather than just less likely.
That relational rigor imposes what one chooses to be true, it isn’t a universal truth.
The frame problem and the qualification problem apply here.
The open domain frame problem == HALT.
When you can for a problem into the relational model things are nice but not everything can be reduced to a trivial property.
That is why Codd had to as nulls etc..
You can choose to decide that the queen is rich OR pigs can fly; but a poor queen doesn’t result in flying pigs.
Choice over finite sets == finite indexes over sets == PEM
If you can restrict your problems to where the Entscheidungsproblem is solvable you can gain many benefits
But it is horses for courses and sub TC.
The query not utilising an unique constraint/index should have raised a red flag.
Have any of the post-mortems addressed if any of the code that led to CloudFlare's outage was generated by AI?
No, their error was that they shouldn't be querying system tables to perform field discovery; the same method in postgresql (pg_class or whatever its called) would have had the same result. The simple alternative is to use "describe table <table_name>".
On top of that, they shouldn't be writing ad-hoc code to query system tables, but having a separate library instead to perform those kind of task mixed with business logic (crappy application design).
Also, this should never have passed code review in the first place, but lets assume it did because errors happen, and this kind of atrocious code and flaky design is not uncommon.
As an example, they could be reading this data from CSV files *and* have made the same mistake. Conflating this with "database design errors" is just stupid - this is not a schema design error, this is a programmer error.
> 1. No nullable fiels.
Is that a typo there? fiels should be fields?
What caused it was rolling out a change and moving on to the next recipient without checking if the previous task instantly died.
You can't prevent all crash bugs, but you can check if you are lasering your whole prod.