When talking of their earlier Lua code:
> we have never before applied a killswitch to a rule with an action of “execute”.
I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?
It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.
I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.
I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.
> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules
Warning signs like this are how you know that something might be wrong!
Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.
Some people even go further by speculating that the original military DARPA network precursor to the modern Internet was originally designed to ensure the continuity of command and control (C&C) of the US military operation in the potential event of all out nuclear attack during the Cold War.
This the time when Internet researchers need to redefine the Internet application and operation. The local-first paradigm is the first step in the right direction (pardon the pun) [2].
[1] The Real Internet Architecture: Past, Present, and Future Evolution:
https://press.princeton.edu/books/paperback/9780691255804/th...
[2] Local-first software You own your data, in spite of the cloud:
The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.
I have seen similar bugs in cloudflare API recently as well.
There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.
They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?
Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place
I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.
After some investigation, I realized that none of these routes passed through Cloudflare OWASP. The reported anomalies total 50, exceeding the pre-configured maximum of 40 (Medium).
Despite being simple image or video uploads, the WAF is generating anomalies that make no sense, such as the following:
Cloudflare OWASP Core Ruleset Score (+5)
933100: PHP Injection Attack: PHP Open Tag Found
Cloudflare OWASP Core Ruleset Score (+5)
933180: PHP Injection Attack: Variable Function Call Found
For now, I’ve had to raise the OWASP Anomaly Score Threshold to 60 and enable the JS Challenge, but I believe something is wrong with the WAF after today’s outage.
This issue was still not solved to this moment.
But a more important takeaway:
> This type of code error is prevented by languages with strong type systems
After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:
> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset
> a straightforward error in the code, which had existed undetected for many years
Every change is a deployment, even if its config. Treat it as such.
Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().
It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.
I have a mixed feeling about this.
On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.
At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.
True, as long as you don't call unwrap!
Yes, this is the second time in a month. Were folks expecting that to have been enough time for them to have made sweeping technical and organization changes? I say no—this doesn't mean they aren't trying or haven't learned any lessons from the last outage. It's a bit too soon to say that.
I see this event primarily as another example of the #1 class of major outages: bad rapid global configuration change. (The last CloudFlare outage was too, but I'm not just talking about CloudFlare. Google has had many many such outages. There was an inexplicable multi-year gap between recognizing this and having a good, widely available staged config rollout system for teams to drop into their systems.) Stuff like DoS attack configurations needs to roll out globally quickly. But they really need make it not quite this quick. Imagine they deployed to one server for one minute, one region for one minute on success, then everywhere on success. Then this would have been a tiny blip rather than a huge deal.
(It can be a bit hard to define "success" when you're doing something like blocking bad requests that may even be a majority of traffic during a DDoS attack, but noticing 100% 5xx errors for 38% of your users due to a parsing bug is doable!)
As for the specific bug: meh. They should have had 100% branch coverage on something as critical (and likely small) as the parsing for this config. Arguably a statically typed language would have helped (but the `.unwrap()` error in the previous outage is a bit of a counterargument to that). But it just wouldn't have mattered that much if they caught it before global rollout.
That being said, I think it’s worth a discussion. How much of the last 3 outages were because of the JGC (the former CTO) retiring and Dane taking over?
Did JGC have a steady hand that’s missing? Or was it just time for outages that would have happened anyway?
Dane has maintained a culture of transparency which is fantastic, but did something get injected in the culture leading towards these issues? Will it become more or less stable since JGC left?
Curious for anyone with some insight or opinions.
(Also, if it wasn’t clear - huge Cloudflare fan and sending lots of good vibes to the team)
So they are aware of some basic mitigation tactics guarding against errors
> This system does not perform gradual rollouts,
They just choose to YOLO
> Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”,
> However, we have never before applied a killswitch to a rule with an action of “execute”.
Do they do no testing? These isn't even fuzzing with “infinite” variations, but a limited list of actions
> existed undetected for many years. This type of code error is prevented by languages with strong type systems.
So this solution is also well known, just ignored for years, because "if it’s not broken, don’t fix it?", right?
Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?
It still surprises me that there are basically no free alternatives comparable to Cloudflare. Putting everything on CF creates a pretty serious single point of failure.
It's strange that in most industries you have at least two major players, like Coke vs. Pepsi or Nike vs. Adidas. But in the CDN/edge space, there doesn't seem to be a real free competitor that matches Cloudflare's feature set.
It feels very unhealthy for the ecosystem. Does anyone know why this is the case?
I don't see how their main product is ddos protection, yet cloudflare goes down for some reason.
This company makes zero sense to me.
https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-...
- prioritize security: get patchs ASAP
- prioritize availability: get patchs after a cooldown period
Because ultimately, it's a tradeoff that cannot be handled by Cloudflare. It depends on your business, your threat model.
Benefit: Earliest uptake of new features and security patches.
Drawback: Higher risk of outages.
I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.
at one point in time - they've to admit this react thing ain't working. & just use classic server rendered pages, since their dashboards are simple toggle controls
If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?
I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.
HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.
At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.
Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.
@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems
> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.
Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.
ALSO, very very weird that they had not caught this seemingly obvious bug in proxy buffer size handling. This points to that the change nr 2, done in "reactive" mode to change nr 1 that broke shit, HAD NOT BEEN TESTED AT ALL! Which is the core reason they should never have deployed that, but rather revert to a known good state, then test BOTH changes combined.
"Why?"
"I've just been transferred to the Cloudflare outage explanation department."
Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.
Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.
Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.
But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).
Come on.
This PM raises more questions than it answers, such as why exactly China would have been immune.
https://blog.cloudflare.com/5-december-2025-outage/#what-abo...
Interesting.