FRESH

Hacker News

Home

Cloudflare outage on December 5, 2025

765 points by meetpateltech

by mixedbit

21 subcomments

This is architectural problem, the LUA bug, the longer global outage last week, a long list of earlier such outages only uncover the problem with architecture underneath. The original, distributed, decentralized web architecture with heterogeneous endpoints managed by myriad of organisations is much more resistant to this kind of global outages. Homogeneous systems like Cloudflare will continue to cause global outages. Rust won't help, people will always make mistakes, also in Rust. Robust architecture addresses this by not allowing a single mistake to bring down myriad of unrelated services at once.

by w10-1

4 subcomments

Kudos to Cloudflare for clarity and diligence.
When talking of their earlier Lua code:
> we have never before applied a killswitch to a rule with an action of “execute”.
I was surprised that a rules-based system was not tested completely, perhaps because the Lua code is legacy relative to the newer Rust implementation?
It tracks what I've seen elsewhere: quality engineering can't keep up with the production engineering. It's just that I think of CloudFlare as an infrastructure place, where that shouldn't be true.
I had a manager who came from defense electronics in the 1980's. He said in that context, the quality engineering team was always in charge, and always more skilled. For him, software is backwards.

by cpncrunch

8 subcomments

I've noticed that in recent months, even apart from these outages, cloudflare has been contributing to a general degradation and shittification of the internet. I'm seeing a lot more "prove you're human", "checking to make sure you're human", and there is normally at the very least a delay of a few seconds before the site loads.
I don't think this is really helping the site owners. I suspect it's mainly about AI extortion:
https://blog.cloudflare.com/introducing-pay-per-crawl/

by jacobgkau

4 subcomments

I noticed this outage last night (Cloudflare 500s on a few unrelated websites). As usual, when I went to Cloudflare's status page, nothing about the outage was present; the only thing there was a notice about the pre-planned maintenance work they were doing for the security issue, reporting that everything was being routed around it successfully.

by Scaevolus

5 subcomments

> Disabling this was done using our global configuration system. This system does not use gradual rollouts but rather propagates changes within seconds to the entire network and is under review following the outage we recently experienced on November 18.
> As soon as the change propagated to our network, code execution in our FL1 proxy reached a bug in our rules module which led to the following LUA exception:
They really need to figure out a way to correlate global configuration changes to the errors they trigger as fast as possible.
> as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules
Warning signs like this are how you know that something might be wrong!

by lionkor

6 subcomments

Cloudflare is now below 99.9% uptime, for anyone keeping track. I reckon my home PC is at least 99.9%.

by uyzstvqs

6 subcomments

What I'm missing here is a test environment. Gradual or not; why are they deploying straight to prod? At Cloudflare's scale, there should be a dedicated room in Cloudflare HQ with a full isolated model-scale deployment of their entire system. All changes should go there first, with tests run for every possible scenario.
Only after that do you use gradual deployment, with a big red oopsie button which immediately rolls the changes back. Languages with strong type systems won't save you, good procedure will.

by teleforce

0 subcomment

Internet packet switching based architecture was originally design to withstand this type of outages [1].
Some people even go further by speculating that the original military DARPA network precursor to the modern Internet was originally designed to ensure the continuity of command and control (C&C) of the US military operation in the potential event of all out nuclear attack during the Cold War.
This the time when Internet researchers need to redefine the Internet application and operation. The local-first paradigm is the first step in the right direction (pardon the pun) [2].
[1] The Real Internet Architecture: Past, Present, and Future Evolution:
https://press.princeton.edu/books/paperback/9780691255804/th...
[2] Local-first software You own your data, in spite of the cloud:
https://www.inkandswitch.com/essay/local-first/

by liampulles

2 subcomments

The lesson presented by the last few big outages is that entropy is, in fact, inescapable. The comprehensibility of a system cannot keep up with its growing and aging complexity forever. The rate of unknown unknowns will increase.
The good news is that a more decentralized internet with human brain scoped components is better for innovation, progress, and freedom anyway.

by miyuru

3 subcomments

Whats going on with cloudflare's software team?
I have seen similar bugs in cloudflare API recently as well.
There is an endpoint for a feature that is available only to enterprise users, but the check for whether the user is on an enterprise plan is done at the last step.

by flaminHotSpeedo

13 subcomments

What's the culture like at Cloudflare re: ops/deployment safety?
They saw errors related to a deployment, and because it was related to a security issue instead of rolling it back they decided to make another deployment with global blast radius instead?
Not only did they fail to apply the deployment safety 101 lesson of "when in doubt, roll back" but they also failed to assess the risk related to the same deployment system that caused their 11/18 outage.
Pure speculation, but to me that sounds like there's more to the story, this sounds like the sort of cowboy decision a team makes when they've either already broken all the rules or weren't following them in the first place

by paradite

6 subcomments

The deployment pattern from Cloudflare looks insane to me.
I've worked at one of the top fintech firms, whenever we do a config change or deployment, we are supposed to have rollback plan ready and monitor key dashboards for 15-30 minutes.
The dashboards need to be prepared beforehand on systems and key business metrics that would be affected by the deployment and reviewed by teammates.
I've never seen a downtime longer than 1 minute while I was there, because you get a spike on the dashboard immediately when something goes wrong.
For the entire system to be down for 10+ minutes due to a bad config change or deployment is just beyond me.

by Smalltalker-80

0 subcomment

So Cloudflare: - Did a last minute, untested change to their change: "turning off our WAF rule testing tool". - Did an immediate global roll-out, instead of a staged one. . Is seems they would have enough leaning-cases now never to do that again...

by ferat

0 subcomment

Today, after the Cloudflare outage, I noticed that almost all upload routes for my applications were being blocked.
After some investigation, I realized that none of these routes passed through Cloudflare OWASP. The reported anomalies total 50, exceeding the pre-configured maximum of 40 (Medium).
Despite being simple image or video uploads, the WAF is generating anomalies that make no sense, such as the following:
Cloudflare OWASP Core Ruleset Score (+5)
933100: PHP Injection Attack: PHP Open Tag Found
Cloudflare OWASP Core Ruleset Score (+5)
933180: PHP Injection Attack: Variable Function Call Found
For now, I’ve had to raise the OWASP Anomaly Score Threshold to 60 and enable the JS Challenge, but I believe something is wrong with the WAF after today’s outage.
This issue was still not solved to this moment.

by rachr

1 subcomments

Time for Cloudflare to start using the BOFH excuse generator. https://bofh.d00t.org/

by xnorswap

3 subcomments

My understanding, paraphrased: "In order to gradually roll out one change, we had to globally push a different configuration change, which broke everything at once".
But a more important takeaway:
> This type of code error is prevented by languages with strong type systems

by jakub_g

1 subcomments

The interesting part:
After rolling out a bad ruleset update, they tried a killswitch (rolled out immediately to 100%) which was a code path never executed before:
> However, we have never before applied a killswitch to a rule with an action of “execute”. When the killswitch was applied, the code correctly skipped the evaluation of the execute action, and didn’t evaluate the sub-ruleset pointed to by it. However, an error was then encountered while processing the overall results of evaluating the ruleset
> a straightforward error in the code, which had existed undetected for many years

by 8cvor6j844qw_d6

4 subcomments

Is there some underlying factors that resulted in the recent outages (e.g., new processes, layoffs, etc.) or just a series of pure coincidences?

by aeyes

0 subcomment

How hard can it be for a company with 1000 engineers to create a canary region before blasting their centralized changes out to everyone.
Every change is a deployment, even if its config. Treat it as such.
Also you should know that a strongly typed language won't save you from every type of problem. And especially not if you allow things like unwrap().
It is just mind boggling that they very obviously have completely untested code which proxies requests for all their customers. If you don't want to write the tests then at least fuzz it.

by gkoz

2 subcomments

I sometimes feel we'd be better off without all the paternalistic kitchensink features. The solid, properly engineered features used intentionally aren't causing these outages.

by egorfine

1 subcomments

> provides customers with protection against malicious payloads, allowing them to be detected and blocked. To do this, Cloudflare’s proxy buffers HTTP request body content in memory for analysis.
I have a mixed feeling about this.
On the other hand, I absolutely don't want a CDN to look inside my payloads and decide what's good for me or. Today it's protection, tomorrow it's censorship.
At the same time this is exactly what CloudFlare is good for - to protect sites from malicious requests.

by seanparsons

0 subcomment

"This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur." It's starting to sound like a broken record at this point, languages are still seen as equal and as a result, interchangeable.

by resonious

1 subcomments

> This type of code error is prevented by languages with strong type systems.
True, as long as you don't call unwrap!

by scottlamb

0 subcomment

There's a lot of bad karma in this discussion. It's hard to run large services. Careful when you set a precedent of pillorying after an outage. It could be you next!
Yes, this is the second time in a month. Were folks expecting that to have been enough time for them to have made sweeping technical and organization changes? I say no—this doesn't mean they aren't trying or haven't learned any lessons from the last outage. It's a bit too soon to say that.
I see this event primarily as another example of the #1 class of major outages: bad rapid global configuration change. (The last CloudFlare outage was too, but I'm not just talking about CloudFlare. Google has had many many such outages. There was an inexplicable multi-year gap between recognizing this and having a good, widely available staged config rollout system for teams to drop into their systems.) Stuff like DoS attack configurations needs to roll out globally quickly. But they really need make it not quite this quick. Imagine they deployed to one server for one minute, one region for one minute on success, then everywhere on success. Then this would have been a tiny blip rather than a huge deal.
(It can be a bit hard to define "success" when you're doing something like blocking bad requests that may even be a majority of traffic during a DDoS attack, but noticing 100% 5xx errors for 38% of your users due to a parsing bug is doable!)
As for the specific bug: meh. They should have had 100% branch coverage on something as critical (and likely small) as the parsing for this config. Arguably a statically typed language would have helped (but the `.unwrap()` error in the previous outage is a bit of a counterargument to that). But it just wouldn't have mattered that much if they caught it before global rollout.

by mxpxrocks10

1 subcomments

First, what Cloudflare does is hard and I want to start with that.
That being said, I think it’s worth a discussion. How much of the last 3 outages were because of the JGC (the former CTO) retiring and Dane taking over?
Did JGC have a steady hand that’s missing? Or was it just time for outages that would have happened anyway?
Dane has maintained a culture of transparency which is fantastic, but did something get injected in the culture leading towards these issues? Will it become more or less stable since JGC left?
Curious for anyone with some insight or opinions.
(Also, if it wasn’t clear - huge Cloudflare fan and sending lots of good vibes to the team)

by eviks

0 subcomment

> This first change was being rolled out using our gradual deployment system.
So they are aware of some basic mitigation tactics guarding against errors
> This system does not perform gradual rollouts,
They just choose to YOLO
> Typical actions are “block”, “log”, or “skip”. Another type of action is “execute”,
> However, we have never before applied a killswitch to a rule with an action of “execute”.
Do they do no testing? These isn't even fuzzing with “infinite” variations, but a limited list of actions
> existed undetected for many years. This type of code error is prevented by languages with strong type systems.
So this solution is also well known, just ignored for years, because "if it’s not broken, don’t fix it?", right?

by hrimfaxi

3 subcomments

Having their changes fully propagate within 1 minute is pretty fantastic.

by rany_

2 subcomments

> As part of our ongoing work to protect customers using React against a critical vulnerability, CVE-2025-55182, we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.
Why would increasing the buffer size help with that security vulnerability? Is it just a performance optimization?

by kune

0 subcomment

The interesting aspect of the Cloudflare support, which is not clarified, is how they came to the risk assessment that it is ok to roll out a change non-gradual globally without testing the procedure first. The only justification I can see is that the React/Next.js remote command execution vulnerabilities are actively exploited. But if this is the case they should say so.

by mr_windfrog

1 subcomments

If I'm remembering correctly, there was another outage around 10 days ago.
It still surprises me that there are basically no free alternatives comparable to Cloudflare. Putting everything on CF creates a pretty serious single point of failure.
It's strange that in most industries you have at least two major players, like Coke vs. Pepsi or Nike vs. Adidas. But in the CDN/edge space, there doesn't seem to be a real free competitor that matches Cloudflare's feature set.
It feels very unhealthy for the ecosystem. Does anyone know why this is the case?

by jokoon

1 subcomments

I still don't understand what is cloudflare's business model, yet they manage to make news.
I don't see how their main product is ddos protection, yet cloudflare goes down for some reason.
This company makes zero sense to me.

by qouteall

0 subcomment

It's (at least) the second time Couldflage gets bitten by React. Last time an useEffect caused an incident.
https://blog.cloudflare.com/deep-dive-into-cloudflares-sept-...

by egnehots

0 subcomment

From a customer perspective, I think there should be an option:
- prioritize security: get patchs ASAP
- prioritize availability: get patchs after a cooldown period
Because ultimately, it's a tradeoff that cannot be handled by Cloudflare. It depends on your business, your threat model.

by Bender

3 subcomments

Suggestion for Cloudflare: Create an early adopter option for free accounts.
Benefit: Earliest uptake of new features and security patches.
Drawback: Higher risk of outages.
I think this should be possible since they already differentiate between free, pro and enterprise accounts. I do not know how the routing for that works but I bet they could do this. Think crowd-sourced beta testers. Also a perk for anything PCI audit or FEDRAMP security prioritized over uptime.

by blinded

0 subcomment

They bypassed the gradual rollout system in order to meet a deadline for a cve. They put security above availability, tough tradeoff. Is there a non prod environment where that one off waf testing tool change could have been tested?

by dzonga

0 subcomment

before Cloudflare suffered an outage due to React's useEffect, now again trying to mitigate security issues around React Server Pages.
at one point in time - they've to admit this react thing ain't working. & just use classic server rendered pages, since their dashboards are simple toggle controls

0 subcomment

by markus_zhang

0 subcomment

I wonder anyone from internal could share the culture a bit. I'm mostly interested in the following part:
If someone messes up royally, is there someone who says "if you break the build/whatever super critical, then your ass is the grass and I'm the lawn mower"?

by _pdp_

1 subcomments

So no static compiler checks and apparently no fuzzers used to ensure these rules work as intended?

by bradly

0 subcomment

Dang… I don’t even use React and it still brings down my sites. Good beats I guess.

by stego-tech

0 subcomment

The problem that irks me isn’t that Cloudflare is having outages (everyone does and will at some point, no matter how many 9’s your SLA states), it’s that the internet is so damn centralized that a Cloudflare issue can take out a continent-sized chunk of the internet. Kudos to them on their success story, but oh my god that’s way too many eggs in one basket in general.

by roguecoder

0 subcomment

I notice that this is the kind of thing that solid sociable tests ought to have caught. I am very curious how testable that code is (random procedural if-statements don't inspire high confidence.)

by away0x01ct

0 subcomment

1.1.1.1 domain test server, whether a relay or endpoints including /cdn-cgi/trace is WAF testing error, for 500 HTTP network & Cloudflare managed R-W-X permissions

by ken47

0 subcomment

I have to wonder if there is a relation to the rising prevalence of coding LLMs.

0 subcomment

by nish__

1 subcomments

Is it crazy to anyone else that they deploy every 5 minutes? And that it's not just config updates, but actual code changes with this "execute" action.

by denysvitali

6 subcomments

Ironically, this time around the issue was in the proxy they're going to phase out (and replace with the Rust one).
I truly believe they're really going to make resilience their #1 priority now, and acknowledging the release process errors that they didn't acknowledge for a while (according to other HN comments) is the first step towards this.
HugOps. Although bad for reputation, I think these incidents will help them shape (and prioritize!) resilience efforts more than ever.
At the same time, I can't think of a company more transparent than CloudFlare when it comes to these kind of things. I also understand the urgency behind this change: CloudFlare acted (too) fast to mitigate the React vulnerability and this is the result.
Say what you want, but I'd prefer to trust CloudFlare who admits and act upon their fuckups, rather than trying to cover them up or downplaying them like some other major cloud providers.
@eastdakota: ignore the negative comments here, transparency is a very good strategy and this article shows a good plan to avoid further problems

by RA_Fisher

0 subcomment

As a reliability statistician (and web user!), I'd love to see Cloudflare investing in reliability statistics. :)

by snafeau

3 subcomments

A lot of these kind of bugs feel like they could be caught be a simple review bot like Greptile... I wonder if Cloudlare uses an equivalent tool internally?

by dznodes

1 subcomments

When should we just give up on Cloudflare? Seems like this just keeps happening. Like some kind of backdoor triggered willy nilly, Hmmm?

by antiloper

2 subcomments

Make faster websites:
> we started rolling out an increase to our buffer size to 1MB, the default limit allowed by Next.js applications.
Why is the Next.js limit 1 MB? It's not enough for uploading user generated content (photographs, scanned invoices), but a 1 MB request body for even multiple JSON API calls is ridiculous. There frameworks need to at least provide some pushback to unoptimized development, even if it's just a lower default request body limit. Otherwise all web applications will become as slow as the MS office suite or reddit.

0 subcomment

by iLoveOncall

1 subcomments

The most surprising from this article is that CloudFlare handles only around 85M TPS.

by mmmlinux

3 subcomments

Messing around on a Friday? Brave.

0 subcomment

by MagicMoonlight

0 subcomment

If you had a 99.99% availability requirement they will have already cost you a fortune

by rurban

0 subcomment

Oh oh, looks like agentzh's code. A living legend

by dwa3592

0 subcomment

I am not sure if it's just me or there have been too many outages this year to count. Is it the AI slop making into production?

by rubatuga

0 subcomment

Honestly a lot of these problems are because they don't test a staging environment, like isn't this software engineering basics?

by sandos

0 subcomment

Is it just me, or should they have just reverted instead of making _another_ change as a result of the first one?
ALSO, very very weird that they had not caught this seemingly obvious bug in proxy buffer size handling. This points to that the change nr 2, done in "reactive" mode to change nr 1 that broke shit, HAD NOT BEEN TESTED AT ALL! Which is the core reason they should never have deployed that, but rather revert to a known good state, then test BOTH changes combined.

by kachapopopow

0 subcomment

why does this seem oddly familiar (fail-closed logic)

by dreamcompiler

0 subcomment

"Honey we can't go on that vacation after all. In fact we can't ever take a vacation period."
"Why?"
"I've just been transferred to the Cloudflare outage explanation department."

by nish__

0 subcomment

No love lost, no love found.

by guluarte

0 subcomment

is it me or critical software bugs are more and more common?

by lapcat

1 subcomments

> This is a straightforward error in the code, which had existed undetected for many years. This type of code error is prevented by languages with strong type systems. In our replacement for this code in our new FL2 proxy, which is written in Rust, the error did not occur.
Cloudflare deployed code that was literally never tested, not even once, neither manually nor by unit test, otherwise the straightforward error would have been detected immediately, and their implied solution seems to be not testing their code when written, or even adding 100% code coverage after the fact, but rather relying on a programming language to bail them out and cover up their failure to test.

by dev1ycan

0 subcomment

Stop vibe coding on critical infrastructure :)

0 subcomment

by arjie

0 subcomment

Classic. Things always get worse before they get better. I remember when Netflix was going through their annus horribilis, and AWS before that, and Twitter before that, and so on. Everyone goes through this. Good luck to you guys getting to FL2 quickly enough that this class of error reduces.

by lofaszvanitt

0 subcomment

How come that crumbling, rotten to the core lua is still used? :D

0 subcomment

by rudedogg

0 subcomment

I’m really sick of constantly seeing cloudflare, and their bullshit captchas. Please, look at how much grief they’re causing trying to be the gateway to the internet. Don’t give them this power

by j45

0 subcomment

Curious if there isn't a way to ingest the incoming traffic at scale, but route it to a secondary infrastructure to make sure it's resolving correctly, before pushing it to production?

by AtNightWeCode

0 subcomment

Not missing working with LUA in proxies. I think this is no big thing. They rolled back the change fairly quickly. Still bad but that outage mid November was worse since it was many bad decisions stacking up and it took too long time to resolve.

by system2

1 subcomments

Is that me, or did CloudFlare outages increase since LLM "engineers" were hired remotely? Do you think there is a correlation?

by rvz

0 subcomment

> Instead, it was triggered by changes being made to our body parsing logic while attempting to detect and mitigate an industry-wide vulnerability disclosed this week in React Server Components.
Doesn't Cloudflare rigorously test their changes before deployment to make sure that this does not happen again? This better not have been used to cover for the fact that they are using AI to fix issues like this one.
Better not be any presence of vibe coders or AI agents being used to be touching such critical pieces of infrastructure at all and I expected Cloudflare to learn from the previous outage very quickly.
But this is quite a pattern but might need to consider putting the unreliability next to GitHub (which goes down every week).

0 subcomment

by fidotron

1 subcomments

> This change was being rolled out using our gradual deployment system, and, as part of this rollout, we identified an increase in errors in one of our internal tools which we use to test and improve new WAF rules. As this was an internal tool, and the fix being rolled out was a security improvement, we decided to disable the tool for the time being as it was not required to serve or protect customer traffic.
Come on.
This PM raises more questions than it answers, such as why exactly China would have been immune.

by gotekom952

0 subcomment

"They let the internet down"

by borplk

0 subcomment

Every time they screw up they write an elaborate postmortem and pat themselves on the back. Don't get me wrong, better have the postmortem than not. But at this point it seems like the only thing they are good at is writing incident postmortem blog posts.

by jgalt212

0 subcomment

I do kind of like who they are blaming React for this.

0 subcomment

by blibble

0 subcomment

amateur level stuff again

by theoldgreybeard

0 subcomment

This is total amateur shit. Completely unacceptable for something as critical as Cloudflare.

by Uptrenda

0 subcomment

Can't believe one shitty website can take down most of the mainstream web.

by da_grift_shift

2 subcomments

It's not an outage, it's an Availability Incident™.
https://blog.cloudflare.com/5-december-2025-outage/#what-abo...

by jchip303

0 subcomment

[dead]

by alwaysroot

0 subcomment

[flagged]

by kosolam

1 subcomments

Some nonsense again. The level of negligence there is astounding. This is frightening because this entity is daily exposed to a large portion of our personal data which goes over the wire. As well as business data. It’s just a matter of time before a disaster is going to occur. Some regulatory body must take control in their hands right now.

by websiteapi

3 subcomments

i wonder why they cannot partially rollout. like the other outage they have to do a global rollout.

by barbazoo

1 subcomments

> Customers that did not have the configuration above applied were not impacted. Customer traffic served by our China network was also not impacted.
Interesting.

by jpeter

3 subcomments

Unwrap() strikes again