In large, distributed systems the best we're looking for is statistically acceptable. You can always tailor a workload that will break a guarantee in the real world.
So you engineer with techniques that reduce the likelihood that workloads you have characterized as realistic can be handled with headroom, and you worry about graceful degradation under oversubscription (i.e. maintaining "good-put"). In my experience, that usually comes down to good load-balancing, auto-scaling and load-shedding.
Virtually all of the truly bad incidents I've seen in large-scale distributed systems are caused by an inability to recover back to steady-state after some kind of unexpected perturbation.
If I had to characterize problem number one, it's bad subscriber-service request patterns that don't provide back pressure appropriately. e.g. subscribers that don't know how to back-off properly and services that don't provide back-pressure. Classical example is a subscriber that retries requests on a static schedule and gives up on requests that have been in-flight "too long", coupled with services that continue to accept requests when oversubscribed.
It's like the quote attributed to Don Knuth: "Beware of bugs in the above code; I have only proved it correct, not tried it."
But, in practice, I've spent just as much time on issues introduced by perf / scalability limitations. And the post thesis is correct: we don't have great tools for reasoning about this. This has been pretty much all I've been thinking about recently.
https://arxiv.org/abs/2205.15211
Already we have Resource Aware ML which
> automatically and statically computes resource-use bounds for OCaml programs
A hammer is great for certain things but I don't expect it to make good coffee. I use other tools for that. However that doesn't make hammers deficient.
Tools like Lean and Rocq can do arbitrary math — the limit is your time and budget, not the tool.
These performance questions can be mathematically defined, so it is possible.