> P_{total}(Success) = 1 - P_{3rdParty}(Failure)^{RetryCount}
By treating P_{3rdParty}(Failure) as fixed, they're assuming a model in which each each try is completely independent: all the failures are due to background noise. But that's totally wrong, as shown by the existence of big outages like the one they're describing, and not consistent with the way they describe outages in terms of time they are down (rather than purely fraction of requests).
In reality, additional retries don't improve reliability as much as that formula says. Given that request 1 failed, request 2 (sent immediately afterward with the same body) probably will too. And there's another important effect: overload. During a major outage, retries often decrease reliability in aggregate—maybe retrying one request makes it more likely to go through, but retrying all the requests causes significant overload, often decreasing the total number of successes.
I think this correlation is a much bigger factor than "the reliability of that retry handler" that they go into instead. Not sure what they mean there anyway—if the retry handler is just a loop within the calling code, calling out its reliability separately from the rest of the calling code seems strange to me. Maybe they're talking about an external queue (SQS and the like) for deferred retries, but that brings in a whole different assumption that they're talking about something that can be processed asynchronously. I don't see that mentioned, and it seems inconsistent with the description of these requests as on the critical path for their customers. Or maybe they're talking about hitting a "circuit breaker" that prevents excessive retries—which is a good practice due to the correlation I mentioned above, but if so it seems strange to describe it so obliquely, and again strange to describe its reliability as an inherent/independent thing, rather than a property of the service being called.
Additionally, a big pet peeve of mine is talking about reliability without involving latency. In practice, there's only so long your client is willing to wait for the request to succeed. If say that's 1 second, and you're waiting 500 ms for an outbound request before timing out and retrying, you can't even quite make it to 2 full (sequential) tries. You can hedge (wait a bit then send a second request in parallel) for many types of requests, but that also worsens the math on overload and correlated failures.
The rest of the article might be much clearer, but I have a fever and didn't make it through.
I do worry about all the automation being another failure point, along with the IaC stuff. That is all software too! How do you update that safely? It's turtles all the way down!
Had no idea that Route 53 had this sort of functionality
I'll try to add comments and answer questions where I can.
- Warren
> [Our service can only go down] five minutes and 15 seconds per year.
I don't have much experience in this area, so please correct me if I'm mistaken:
Don't these two quotes together imply that they have failed to deliver on their SLA for the subset of their customers that want their service in us-east-1? I understand the customers won't be mad at them in this case, since us-east-1 itself is down, but I feel like their title is incorrect. Some subset of their service is running on top of AWS. When AWS goes down, that subset of their service is down. When AWS was down, it seems like they were also down for some customers.
Love the deadpan delivery.
(For what it's worth, for some of my services, 200ms is certainly an impact, not as bad as 2 seconds out outage but still noticable and reportable)