During some thread, some where, there's going to be a roundtrip time between my servers and yours, and once I am at a scale where this sort of thing matters, I'm going to want this on-prem.
What's the difference between this and checking against a local cache before firing the request and marking the service down in said local cache so my other systems can see it?
I'm also concerned about a false positive or a single system throwing an error. If it's a false positive, then the protected asset fails on all of my systems, which doesn't seem great. I'll take some requests working vs none when money is in play.
You also state that "The SDK keeps a local cache of breaker state" -- If I've got 50 servers, where is that local cache living? If it's per process, that's not great, and if it's in a local cache like redis or memcache, I'm better off using my own network for "sub microsecond response" vs the time to go over the wire to talk to your service.
I've fought huge cascading issues in production at very large social media companies. It takes a bit more than breakers to solve these problems. Backpressure is a critical component of this, and often turning things off completey isn't the best approach.
But this is a feature, not a bug. You seems to be assuming that people use circuit-breaks only on external requests, in this situation your approach seems reasonable.
If you have cbs between every service call your model doesn't seem a good idea. Where I work every network call is behind a cb (external services, downstream services, database, redis, s3, ...) and it's pretty common to see failures isolated in a single k8s node. In this situation we want to have independent cbs, they can open independently.
Your take on observability/operation seems interesting but it is pretty close to feature flags. And that is exactly how we handle these scenarios, we have a couple of feature flags we can enable to switch traffic around during outages. Switching to fallback is easy most of the time, but switching back to normal operation is harder to do.
Doesn't this suffer from the opposite problem though? There is a very brief hiccup for Stripe and instance 7 triggers the circuitbreaker. Then all other services stop trying to contact Stripe even though Stripe has recovered in the mean time. Or am I missing something about how your platform works?
Making your circuit breaker state global seems like it would just exacerbate the problem. Failures are often partial in the real world.
what happens when your service goes down