FRESH

Hacker News

Home

K8s with 1M nodes

190 points by denysvitali

by __turbobrew__

3 subcomments

I feel like etcd is one of the few use cases where Intel Optane would actually make sense. I build and run several bare metal clusters with over 10k nodes and etcd is by and large the biggest pain for us. Sometimes an etcd node just randomly stops accepting any proposals which halts the entire cluster until you can remove the bad etcd node.
From what I remember, GKE has implemented an etcd shim on top of spanner as a way to get around the scalability issues, but unfortunately for the rest of us who do not have spanner there aren’t any great options.
I feel like at a fundamental level that pod affinity, antiaffinity, and topology spreads are not compatible with very large clusters due to the complexity explosion in large clusters.
Another thing to consider is that the larger a cluster becomes, the larger the blast radius is. I have had clusters of 10k nodes spectacularly fail due to code bugs within k8s. Sharding total compute capacity compute capacity into multiple isolated k8s clusters reduces the likelihood that a software bug is going to take down everything as you can carefully upgrade only a single cell at a time with bake periods between each cell.

by kawsper

0 subcomment

That's really impressive and an interesting experiment.
I was about to say that Nomad did something similar, but that was 2 million Docker containers across 6100 nodes, https://www.hashicorp.com/en/c2m

by jeffinhat

0 subcomment

This is an awesome experiment and write up. I really appreciate the reproducibility.
I would like to see how moving to database that scales write throughput with replicas would behave, namely FoundationDB. I think this will require more than an intermediary like kine to be efficient, as the author illustrates the apisever does a fair bit of its own watching and keeping state. I also think there's benefit, at least for blast radius, to shard the server by api group or namespace.
I think years ago this would have been a non starter with the community, but given AWS has replaced etcd (or at least aspects) with their internal log service for their large cluster offering, I bet there's some appetite for making this interchangable and bringing and open source solution to market.
I share the authors viewpoint that for modern cloud based deployments, you're probably best avoiding it and relying on VMs being stable and recoverable. I think reliability does matter if you want to actually realize the "borg" value and run it on bare metal across a serious fleet. I haven't found the business justification to work on that though!

by skeptrune

0 subcomment

I read this as napkin math[1] for Kube and thoroughly enjoyed. You can only find the important numbers relative to performance and scaling by trying to accomplish some kind of goal. Benchmarks are mostly bikeshedding.
[1]: https://sirupsen.com/napkin

by ai-christianson

0 subcomment

Love this. There's no reason k8s shouldn't scale much further.

by wppick

1 subcomments

If you don't need the isolation of of k8s then don't forget about erlang, which is another option to scale up to 1 million functions. Obviously k8s containers (which are fundamentally just isolated processes) and erlang processes are not interchangeable things, but when thinking about needing in the order of millions of processes erlang is pretty good prior art

by rixed

1 subcomments

I don't get the point of benchmarking k8s without the guarantees of etcd. At some point, you are just competing with clusterssh.

by wb14123

1 subcomments

Instead of giving up the good guarantee of etcd, a better approach maybe grouping some nodes together to create a tree like structure with sub clusters.

by vebgen

0 subcomment

This is an absolutely incredible technical deep-dive. The section on replacing etcd with mem_etcd resonates with challenges we've been tackling at a much smaller scale building an AI agent system.
A few thoughts:
*On watch streams and caching*: Your observation about the B-Tree vs hashmap cache tradeoff is fascinating. We hit similar contention issues with our agent's context manager - switched from a simple dict to a more complex indexed structure for faster "list all relevant context" queries, but update performance suffered. The lesson about O(1) writes vs O(log n) reads being the wrong tradeoff for high-write workloads is universal.
*On optimistic concurrency for scheduling*: The scatter-gather scheduler design is elegant. We use a similar pattern for our dual-agent system (TARS planner + CASE executor) where both agents operate semi-independently but need coordination. Your point about "presuming no conflicts, but handling them when they occur" is exactly what we learned - pessimistic locking kills throughput far worse than occasional retries.
*The spicy take on durability*: "Most clusters don't need etcd's reliability" is provocative but I suspect correct for many use cases. For our Django development agent, we keep execution history in SQLite with WAL mode (no fsync), betting that if the host crashes, we'd rather rebuild from Git than wait on every write. Similar philosophy.
The mem_etcd implementation in Rust is particularly interesting - curious if you considered using FoundationDB's storage engine or something similar vs rolling your own? The per-prefix file approach is clever for reducing write amplification.
Fantastic work - this kind of empirical systems research is exactly what the community needs more of. The "what are the REAL limits" approach vs "conventional wisdom says X" is refreshing.

by rememberlenny

0 subcomment

People don’t realize how crucial Ben was in the forming of OpenAI as it is known today. This is an extremely underrated post.

by ktpsns

3 subcomments

Typical large scale high performance computing clusters are at a size of 10k nodes (for instance Jupiter and SuperMUC in Germany) [1]. These centers are quite remarkably big buildings. I wonder how much 1M node single k8s clusters there are in the world right now. Most likely at the hyperscalers.
[1] what is a node? Typically it is a synonym for "server". In some configurations HPC schedulers allow node sharing. Then we talk about order of 100k cores to be scheduled.

by halayli

1 subcomments

without publishing mem_etcd code, and without telling us what happens when one of the etcd/mem_etcd node dies to compare, this write up doesn't provide much information.

by up2isomorphism

2 subcomments

“Perhaps my spiciest take from this entire project: most clusters don’t actually need the level of reliability and durability that etcd provides.”
This assumption is completely out of touch, and is especially funny when the goal is to build an extra large cluster.