- I’m sure this work is very impressive, but these QPS numbers don’t seem particularly high to me, at least compared to existing horizontally scalable service patterns. Why is it hard for the kube control plane to hit these numbers?
For instance, postgres can hit this sort of QPS easily, afaik. It’s not distributed, but I’m sure Vitess could do something similar. The query patterns don’t seem particularly complex either.
Not trying to be reductive - I’m sure there’s some complexity here I’m missing!
by __turbobrew__
7 subcomments
- It makes me sad that to get these scalability numbers requires some secret sauce on top of spanner, which no body else in the k8s community can benefit from. Etcd is the main bottleneck in upstream k8s and it seems like there is no real steam to build an upstream replacement for etcd/boltdb.
I did poke around a while ago to see what interfaces that etcd has calling into boltdb, but the interface doesn’t seem super clean right now, so the first step in getting off boltdb would be creating a clean interface that could be implemented by another db.
- Papers like this are fascinating engineering, but dangerous marketing.
They convince every Series A startup that they need a multi-region federated control plane for their 50 microservices. I spend half my time convincing my team not to emulate Google, because we don't have Google's scale problems—we have velocity problems.
Complexity is an asset for Google (it's a moat), but a liability for the rest of us. I just want a cluster that doesn't require a dedicated ops team to upgrade.
by blurrybird
1 subcomments
- AWS and Anthropic did this back in July: https://aws.amazon.com/blogs/containers/amazon-eks-enables-u...
by yanhangyhy
2 subcomments
- there is a doc about how to do with 1M nodes: https://bchess.github.io/k8s-1m/#_why
so i guess the title is not true?
- They mention GCS fuse. We've had nothing but performance and stability problems with this.
We treat it as a best effort alternative when native GCS access isn't possible.
by jakupovic
3 subcomments
- Doing this at anything > 1k nodes is a pain in the butt. We decided to run many <100 nodes clusters rather than a few big ones.
- > While we don’t yet officially support 130K nodes, we're very encouraged by these findings. If your workloads require this level of scale, reach out to us to discuss your specific needs
Obviously this is a typical experiment at Google on running a K8s cluster at 130K nodes but if there is a company out their that "requires" this scale, I must question their architecture and their infrastructure costs.
But of course someone will always request that they somehow need this sort of scale to run their enterprise app. But once again, let's remind the pre-revenue startups talking about scale before they hit PMF:
Unless you are ready to donate tens of billions of dollars yearly, you do not need this.
You are not Google.
- What business usecase requires a single cluster with thousands of pods? Wouldn't having multiple clusters, each hosting a few namespaces, be a better architecture?
- View without needing to sign in: https://web.archive.org/web/20251124111136/https://cloud.goo...
- K8S clusters on VMs strike me as odd.
I see the appeal of K8s in dividing raw, stateful hardware to run multiple parallel workloads, but if you're dealing with stateless cloud VMs, why would you need K8S and its overhead when the VM hypervisor already gives you all that functionality?
And if you insist anyway, run a few big VMs rather than many small ones, since K8s overhead is per-node.
by moralestapia
0 subcomment
- Cute. I've done ~2 million (not k8s though, that trash would only slow me down).
by sandGorgon
1 subcomments
- does anyone know the size at openai ? it used to run a 7500 node cluster back in 2021 https://openai.com/index/scaling-kubernetes-to-7500-nodes/
by blamestross
0 subcomment
- I worked in DHTs in grad school. I still double take that Google and other companies "computers dedicated to a task" numbers are missing 2 digits from what I expected. We have a lot of room left for expansion, we just have to relax centralized management expectations.
- You could remove all references to AI/ML topics from this article and it would remain just as interesting and informative. I really hate that we let marketing people cram the buzzword of the day into what should be a purely technical discussion.
by blinding-streak
0 subcomment
- Imagine a Beowulf cluster of these
by supportengineer
0 subcomment
- Imagine a Beowulf cluster of these
- [dead]
- The new mainframe.
- 130k nodes...cute...but can Google conquer the ultimate software engineering challenge they warn you about in CS school? A functional online signup flow?
- Sounds like hell. But I do really dislike Kubernetes: https://benhouston3d.com/blog/why-i-left-kubernetes-for-goog...