FRESH

Hacker News

Home

Questions for Cloudflare

65 points by todsacerdoti

by timenotwasted

2 subcomments

"I don’t know. I wish technical organisations would be more thorough in investigating accidents." - This is just armchair quarterbacking at this point given that they were forthcoming during the incident and had a detailed post-mortem shortly after. The issue is that by not being a fly on the wall in the war room the OP is making massive assumptions about the level of discussions that take place about these types of incidents long after it has left the collective conscience of the mainstream.

by Nextgrid

2 subcomments

It is unfair to blame Cloudflare (or AWS, or Azure, or GitHub) for what’s happening, and I say that as one of the biggest “yellers at the cloud” on here.
Ultimately end-users don’t have a relationship with any of those companies. They have relationships with businesses that chose to rely on them. Cloudflare, etc publish SLAs and compensation schedules in case those SLAs are missed. Businesses chose to accept those SLAs and take on that risk.
If Cloudflare/etc signed a contract promising a certain SLA (with penalties) and then chose to not pay out those penalties then there would be reasons to ask questions, but nothing suggests they’re not holding up their side of the deal - you will absolutely get compensated (in the form of a refund on your bill) in case of an outage.
The issue is that businesses accept this deal and then scream when it goes wrong, yet are unwilling to pay for a solution that does not fail in this way. Those solutions exist - you absolutely can build systems that are reliable and/or fail in a predictable and testable manner; it’s simply more expensive and requires more skill than just slapping a few SaaSes and CNCF projects together. But it is possible - look at the uptime of card networks, stock exchanges, or airplane avionics. It’s just more expensive and the truth is that businesses don’t want to pay for it (and neither are their end-customers - they will bitch about outages, but will immediately run the other way if you ask them to pony up for a more reliable system - and the ones that don’t, already run such a system and were unaffected by the recent outages).

by otterley

1 subcomments

The post is describing a full port-mortem process including a Five Whys (https://en.wikipedia.org/wiki/Five_whys) inquiry. In a mature organization that follows best SRE practices, this will be performed by the relevant service teams, recorded in the port-mortem document, and used for creating follow-up actions. It's almost always an internal process and isn't shared with the public--and often not even with customers under NDA.
We mustn't assume that CloudFlare isn't undertaking this process because we're not an audience to it.

by dkyc

1 subcomments

These engineering insights were not worth the 16 seconds load time this website took.
It's extremely easy, and correspondingly valueless, to ask all kinds of "hard questions" about a system 24h after it had a huge incident. The hard part is doing this appropriately for every part of the system before something happens, while maintaining the other equally rightful goals of the organizations (such as cost-efficiency, product experience, performance, etc.). There's little evidence that suggests Cloudflare isn't doing that, and their track record is definitely good for their scale.

by RationPhantoms

2 subcomments

> I wish technical organisations would be more thorough in investigating accidents.
Cloudflare is probably one of the best "voices" in the industry when it comes to post-mortems and root cause analysis.

by waiwai933

0 subcomment

> Maybe some of these questions are obviously answered in a Cloudflare control panel or help document. I’m not in the market right now so I won’t do that research.
I don't love piling on, but it still shocks me that people write without first reading.

by blixt

2 subcomments

It's a bit odd to come from the outside to judge the internal process of an organization with many very complex moving parts, only a fraction of which we have been given context for, especially so soon after the incident and the post-mortem explaining it.
I think the ultimate judgement must come from whether we will stay with Cloudflare now that we have seen how bad it can get. One could also say that this level of outage hasn't happened in many years, and they are now freshly frightened by it happening again so expect things to get tightened up (probably using different questions than this blog post proposes).
As for what this blog post could have been: maybe a page out of how these ideas were actively used by the author at e.g. Tradera or Loop54.

by vlovich123

1 subcomments

A lot of these questions bely a misunderstanding of how it works - bot management is evaluated inline within the proxy as a feature on the site (similar to other features like image optimization).
So during ingress there’s not an async call to the bot management service which intercepts the request before it’s outbound to origin - it’s literally a Lua script (or rust module in fl2) that runs on ingress inline as part of handling the request. Thus there’s no timeout or other concerns with the management service failing to assign a bot score.
There are better questions but to me the ones posed don’t seem particularly interesting.

by jcmfernandes

0 subcomment

The tone is off. Cloudflare shared a post-mortem on the same day as the incident. It's unreasonable to throw a "I wish technical organisations would be more thorough in investigating accidents".
With that said, I would also like to know how it took them ~2 hours to see the error. That's a long, long time.

by mnholt

4 subcomments

This website could benefit from a CDN…

by tptacek

2 subcomments

It's a detailed postmortem published within a couple hours of the incident and this blog post is disappointed that it didn't provide a comprehensive assessment of all the procedural changes inside the engineering organization that came as a consequence. At the point in time when this blog post was written, it would not have been possible for them to answer these questions.

by spenrose

1 subcomments

I am disappointed to see this article flagged. I thought it was excellent.

by colesantiago

2 subcomments

Maybe instead of asking "questions" to a corporation which their only interest is profit, is now beholden Wall Street and wouldn't care what we think, we should look for answers and alternatives like BunnyCDN [0], Anubis [1], etc.
[0] https://bunny.net/
[1] https://github.com/TecharoHQ/anubis