FRESH

Hacker News

Home

How when AWS was down, we were not

194 points by mooreds

by scottlamb

2 subcomments

I'm surprised the section about retries doesn't mention correlations. They say:
> P_{total}(Success) = 1 - P_{3rdParty}(Failure)^{RetryCount}
By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing, and not consistent with the way they describe outages in terms of time they are down (rather than purely fraction of requests).
In reality, additional retries don't improve reliability as much as that formula says. Given that request 1 failed, request 2 (sent immediately afterward with the same body) probably will too. And there's another important effect: overload. During a major outage, retries often decrease reliability in aggregate—maybe retrying one request makes it more likely to go through, but retrying all the requests causes significant overload, often decreasing the total number of successes.
I think this correlation is a much bigger factor than "the reliability of that retry handler" that they go into instead. Not sure what they mean there anyway—if the retry handler is just a loop within the calling code, calling out its reliability separately from the rest of the calling code seems strange to me. Maybe they're talking about an external queue (SQS and the like) for deferred retries, but that brings in a whole different assumption that they're talking about something that can be processed asynchronously. I don't see that mentioned, and it seems inconsistent with the description of these requests as on the critical path for their customers. Or maybe they're talking about hitting a "circuit breaker" that prevents excessive retries—which is a good practice due to the correlation I mentioned above, but if so it seems strange to describe it so obliquely, and again strange to describe its reliability as an inherent/independent thing, rather than a property of the service being called.
Additionally, a big pet peeve of mine is talking about reliability without involving latency. In practice, there's only so long your client is willing to wait for the request to succeed. If say that's 1 second, and you're waiting 500 ms for an outbound request before timing out and retrying, you can't even quite make it to 2 full (sequential) tries. You can hedge (wait a bit then send a second request in parallel) for many types of requests, but that also worsens the math on overload and correlated failures.
The rest of the article might be much clearer, but I have a fever and didn't make it through.

by rdoherty

2 subcomments

This is probably one of the best summarizations of the past 10 years of my career in SRE. Once your systems get complex enough, something is always broken and you have to prepare for that. Detection & response become just as critical as pre-deploy testing.
I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!

by sharklasers123

3 subcomments

Is there not an inherent risk using an AWS service (Route 53) to do the health check? Wouldn’t it make more sense to use a different cloud provider for redundancy?

by thisnullptr

5 subcomments

It’s fascinating to me people think their services are so important they can’t survive any downtime. Can we all admit that, while annoying, nothing really bad happened even when us-east-1 was down for almost half a working day?

by indigodaddy

2 subcomments

Back in the day (10-12 years ago) at a telecom/cable we accomplished this with F5 Big IP GSLB DNS (and later migrated to A10's GSLB equivalent devices) as the auth DNS server for services/zones that required or were suitable for HA. (I can't totally remember but I'm guessing we must have had a pretty low TTL for this).
Had no idea that Route 53 had this sort of functionality

by wparad

1 subcomments

Hey, I wrote that article!
I'll try to add comments and answer questions where I can.
- Warren

by pinkmuffinere

3 subcomments

> During this time, us-east-1 was offline, and while we only run a limited amount of infrastructure in the region, we have to run it there because we have customers who want it there
> [Our service can only go down] five minutes and 15 seconds per year.
I don't have much experience in this area, so please correct me if I'm mistaken:
Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.

by hartator

1 subcomments

Interesting how engineers like to nerd out about SLAs, but never claim or issue credits when something does occur.

by sam-cop-vimes

1 subcomments

What a well written article! Nothing complex is built overnight, so it is interesting to see how their defenses have evolved to their current state. Requires an engineering team which actually cares about all this and consistency of approach across what seems like 6 years? Impressive.

0 subcomment

by 0xbadcafebee

1 subcomments

It's a very rare day that a professional explanation of real operations best practices lands on HN. Good job, Authress!

by JSR_FDED

2 subcomments

> We test before deployment. There is no better time to test.
Love the deadpan delivery.

0 subcomment

by markdown

1 subcomments

BTW clicking on your website logo takes one to https://authress.io/knowledge-base/ instead of https://authress.io

by iso1631

2 subcomments

I'm interested in how they measure that downtime. If you're down for 200 milliseconds, does that accumulate. How do you even measure that you're down for 200ms.
(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)

by tptacek

1 subcomments

This is a rare case where the original bait-y title is probably better than the de-bait-ified title, because the actual article is much less of a brag and much more of an actual case study.

by oldpersonintx2

0 subcomment

[dead]

by DeathArrow

1 subcomments

TLDR: they use dynamic DNS routing and have fail over regions.